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Abstract In this paper, we present and analyze a new set 
of low -rank recovery algorithms for linear inverse problems 
within the class of hard thresholding methods. We provide 
strategies on how to set up these algorithms via basic in- 
gredients for different configurations to achieve complex- 
ity vs. accuracy tradeoffs. Moreover, we study acceleration 
schemes via memory-based techniques and randomized, e- 
approximate matrix projections to decrease the computa- 
tional costs in the recovery process. For most of the configu- 
rations, we present theoretical analysis that guarantees con- 
vergence under mild problem conditions. Simulation results 
demonstrate notable performance improvements as compared 
to state-of-the-art algorithms both in terms of reconstruction 
accuracy and computational complexity. 

Keywords Affine rank minimization • hard thresholding • 
e-approximation schemes • randomized algorithms. 



1 Introduction 

In this work, we consider the general affine rank minimiza- 
tion (ARM) problem, described as follows: 



^ The arm Problem: Assume X* e 



is a rank-k 



S^ matrix of interest (k ^ min{m, n}) and let A. : ]R™>*" — > 
?-H W be a known linear operator. Given a set of observations 
^ as y ^ AJC* + e E W, we desire to recover X*from y in 
a scalable and robust manner 

The challenge in this problem is to recover the true low- 
rank matrix in subsampled settings where p <C to • n. In such 

A. Kyrillidis 

Laboratory for Information and Inference Systems, Ecole Polytech- 

nique Federate de Lausanne 

Tel.: +41 21 69 31154 

E-mail: anastasios.kyrillidis@epfl.ch 

V. Cevher 

Laboratory for Information and Inference Systems, Ecole Polytech- 

nique Federate de Lausanne 

TeL;+4121 69 31101 

E-mail: volkan.cevher@epfl.ch 



cases, we typically exploit the prior information that X* is 
low-rank and thus, we are interested in finding a matrix X 
of rank at most k that minimizes the data error f{X) 

\\y-AX\^ 



1 2 as follows: 



fiX) 
subject to rank(X) < k. 



nummize 



(1) 



The ARM problem appears in many applications; low di- 
mensional embedding 1 1 ], matrix completion ||2), image com- 
pression |3 1, function learning |4, 5| just to name a few. We 
present below important ARM problem cases, as character- 
ized by the nature of the linear operator A.. 

General linear maps: In many ARM problem cases, A 
or A.* has a dense range, satisfying specific incoherence or 
restricted isometry properties (discussed later in the paper); 
here. A.* is the adjoint operator of A. In Quantum Tomogra- 
phy, ||6) studies the Pauli operator, a compressive linear map 
A that consists of the kronecker product of 2 x 2 matrices 
and obeys restricted isometry properties, defined later in the 
paper. Furthermore, recent developments indicate connec- 
tions of ridge function learning |4 7 1 and phase retrieval ||8j 
with the ARM problem where ^ is a Bernoulli and a Fourier 
operator, respectively. 

Matrix Completion (MC): Let il be the set of ordered 
pairs that represent the coordinates of the observable entries 
in X* . Then, the set of observations satisfy y = AnX* +e 
where An defines a linear mask over the observable entries 
il. To solve the MC problem, a potential criterion is given 
by (fill |2|. As a motivating example, consider the famous 
Netflix problem |9 1, a recommender system problem where 
users' movie preferences are inferred by a limited subset of 
entries in a database. 

Principal Component Analysis: In Principal Compo- 
nent Analysis (PCA), we are interested in identifying a low 
rank subspace that best explains the data in the Euclidean 
sense from the observations y = AX* where A : E™^" — >■ 
MP is an identity linear map that stacks the columns of the 
matrix X* into a single column vector with p = m ■ n. 
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We observe that the PCA problem falls under the ARM cri- 
terion in (fill. While (fTl) is generally NP-hard to solve op- 
timally, PCA can be solved in polynomial time using the 
truncated Singular Value Decomposition (SVD) of A.*y. As 
an extension to the PCA setting, [ lOJ considers the Robust 
PCA problem where y is further corrupted by gross sparse 
noise. We extend the framework proposed in this paper for 
the RPCA case and its generalizations in 1 1 1 1. 

For the rest of the paper, we consider only the low rank 
estimation case in ([T]i. As running test cases to support our 
claims, we consider the MC setting as well as the general 
ARM setting where A. is constituted by permuted subsam- 
pled noiselets | ,12J . 



1 . 1 Two camps of recovery algorithms 



Convex relaxations: In |13J, the authors study the nu- 
clear norm ||X||* :— X]i=i '^i ^^ ^ convex surrogate of 
rank(X) operator so that we can leverage convex optimiza- 
tion approaches, such as interior-point methods — here, cr^ 
denotes the i-th singular value of X. Under basic incoher- 
ence properties of the sensing linear mapping A^ p3) pro- 
vides provable guarantees for unique low rank matrix recov- 
ery using the nuclear norm. 

Once (fT) is relaxed to a convex problem, decades of 
knowledge on convex analysis and optimization can be lever- 
aged. Interior point methods find a solution with fixed preci- 
sion in polynomial time but their complexity might be pro- 
hibitive even for moderate-sized problems [14 , 15 1. More 
suitable for large-scale data analysis, first-order methods con- 
stitute low-complexity alternatives but most of them intro- 
duce complexity vs. accuracy tradeoffs 1 16- 19|. 

Non-convex approaches: In contrast to the convex re- 
laxation approaches, iterative greedy algorithms maintain 
the nonconvex nature of ([T]l. Unfortunately, solving ([T} opti- 
mally is in general NP-hard |20|. Due to this computational 
intractability, the algorithms in this class greedily refine a 
rank-A; solution using only "local" information available at 
the current iteration |l2T1423l . 



block coordinate descent). Hi) memory exploitation, iv) ac- 
tive low-rank subspace tracking and, v) low-rank matrix ap- 
proximations (described next). We highlight the impact of 
these key pieces on the convergence rate and signal recon- 
struction performance and provide optimal and/or efficient 
strategies on how to set up these ingredients under different 
problem conditions. 

Low-rank matrix approximations in hard threshold- 
ing methods: In |25|, the authors show that the solution ef- 
ficiency can be significantly improved by e-approximation 
algorithms. Based on similar ideas, we analyze the impact 
of e-approximate low rank-revealing schemes in the pro- 
posed algorithms with well-characterized time and space co- 
mplexities. Moreover, we provide extensive analysis to prove 
convergence using e-approximate low-rank projections. 

Hard thresholding-based framework with improved 
convergence conditions: We study hard thresholding vari- 
ants that provide salient computational tradeoffs for the class 
of greedy methods on low -rank matrix recovery. These meth- 
ods, as they iterate, exploit the non-convex scaffold of low 
rank subspaces on which the approximation problem resides. 
Using simple analysis tools, we derive improved conditions 
that guarantee convergence, compared to state-of-the-art ap- 
proaches. 

The organization of the paper is as follows. In Section 
l2j we set up the notation and provide some definitions and 
properties, essential for the rest of the paper. In Section [3] 
we describe the basic algorithmic frameworks in a nutshell, 
while in Section |4] we provide important "ingredients" for 
the class of hard-thresholding methods; detailed convergence 
analysis proofs are provided in Section |5] The complexity 
analysis of the proposed algorithms is provided in Section 
l6] We study two acceleration schemes in Sections l7] and [8] 
based on memory utilization and e-approximate low-rank 
projections, respectively. We further improve convergence 
speed by exploiting randomized low rank projections in Sec- 
tion 9, based on power iteration-based subspace finder tools 
p6). We provide empirical support for our claims through 



experimental results on synthetic and real data in Section 10 



Finally, we conclude with future work directions in Section 

m 



1.2 Contributions 



2 Elementary Definitions and Properties 



In this work, we study a special class of iterative greedy al- 
gorithms known as hard thresholding methods. Similar re- 
sults have been derived for the vector case |24|. Note that 
the transition from sparse vector approximation to ARM is 
non-trivial; while s-sparse signals "live" in the union of fi- 
nite number of subspaces, the set of rank-A; matrices expands 
to infinitely many subspaces. Thus, the selection rules do not 
generalize in a straightforward way. 

Our contributions are the following: 

Ingredients of hard thresholding methods: We ana- 
lyze the behaviour and performance of hard thresholding 
methods from a global perspective. Five building blocks are 
studied: i) step size selection /i^, ii) gradient or least-squares 
updates over restricted low-rank subspaces (e.g., adaptive 



We reserve lower-case and bold lower-case letters for scalar 
and vector variable representation, respectively. Bold upper- 
case letters denote matrices while bold calligraphic upper- 
case letters represent linear operators. We use calligraphic 
upper-case letters for set representations. We use X{i) to 
represent the matrix estimate at the i-th iteration. 

The rank of X is denoted as rank(X) < min{m,7i}. 
The empirical data error is denoted as /(X) := ||y— .4Jf||2 
with gradient V/(X) := -2A*{y - AX), where * is 
the adjoint operation over the linear mapping A. The in- 
ner product between matrices A, B E ]]j™x" js denoted as 
{A, B) = trace(S A), where ^ represents the transpose 
operation. I represents an identity matrix with dimensions 
apparent from the context. 
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Let 5 be a set of orthonormal, rank-1 matrices that span 
an arbitrary subspace in M™^". We reserve span(5) to de- 
note the subspace spanned by S. With slight abuse of nota- 
tion, we use: 



this projection is not always unique. In the case of multiple 
identical singular values, the lexicographic approach is used 



to break ties. In any case, 
for any rank-/c W G M™' 



VkiX)-X\< w-x 



rank(span(iS)) 



max{rank(X) : X E span(5)} 



(2) 



to denote the maximum rank a matrix X E jj'nx" can have 
such that X lies in the subspace spanned by the set S. Given 
a finite set S, \S\ denotes the cardinality of 5. For any matrix 
X, we use R{X) to denote its range. 

We define a minimum cardinality set of orthonormal, 
rank-1 matrices that span the subspace induced by a set of 
rank-1 (and possibly non-orthogonal) matrices S as: 



2.2 Subspace projections 

Given a set of orthonormal, rank-1 matrices S, we denote the 
orthogonal projection operator onto the subspace induced by 
S as T'JHwhich is an idempotent linear transformation; fur- 
thermore, we denote the orthogonal projection operator onto 
the orthogonal subspace of S as 'Ps±. We can always de- 



ortho(5) E argmin{|T| 
r 



T QU s.t. span(T) — span(5)}, 



compose a matrix X E 
as follows: 



into two matnx components, 



X ■.= VsX+Vs^X, such that (7'5X,7'5iX) = 0. 



where U denotes the superset that includes all the sets of or- 
thonormal, rank-1 matrices in M™x" such that {Ti,Tj) = If X G span(5), the best projection of X onto the subspace 



0, i ^ j,\fT,,T, E rand, \\T, 



1, Vi. In general, induced by S is the matrix X itself. Moreover, T'sX „ < 



ortho(5) is not unique. 

A well-known lemma used in the convergence rate proofs 
of this class of greedy hard thresholding algorithms is de- 
fined next. 



Lemma 1 l[27y Let J C K™x" ]je a closed convex set and 
f : J^ ^ M. be a smooth objective function defined over J . 
Let X* E S be a local minimum of the objective function f 
over the set J . Then 



(V/(X*),X-X*)>0, -iXEj. 



(3) 



X „ for any S and X. 

Definition 2 [Orthogonal projections using SVD] Let X E 
jjmxn i^g ^ matrix with arbitrary rank and SVD decompo- 
sition given by (Hi. Then, S := {uivf : i = 1, . . . , fc} 
(k < rank(X)) constitutes a set of orthonormal, rank-I 
matrices that spans the best fc-rank subspace in R{X) and 
R{X ); here, Ui and Vi denote the i-th left and right singu- 
lar vectors, respectively. The orthogonal projection onto this 
subspace is given by Q: 



VsX = VuX + XVv - VuXVv 



(6) 



2.1 Singular Value Decomposition (SVD) and its properties ^^^^^ j,^ ^ V ..^v.kU^A-.k and Vv = V ..^x-.kV^v.k i" Mat- 



Definition 1 [SVD] Let X E M™""" be a rank-/ {I < min 
{m, n}) matrix. Then, the SVD of X is given by: 



LAB notation. Moreover, the orthogonal projection onto the 
S-^ is given by: 



X = usv' 



[UaU^ 



so 





1 ryTi 



vl 



Vs±X = X-VsX. 



(7) 



(4) 



where C/c, E M"^',Lf^ E E'"><(™-'), y„ e ]R">^',y^ E 

^nx{n-i) ^jjj J; ^ diag(cri, . . . , ai) E M'^' for CTi, . . . , 

(Ti E M+. Here, the columns of U,V represent the set of 
left and right singular vectors, respectively, and cri , . . . , ct/ 
denote the singular values. 

For any matrix X E M™^" with arbitrary rank(X) < 
min{m, n}, its best orthogonal projection Vk{X) onto the 
set of rank-A: (fc < rank(X)) matrices Ck := {A E M"^" : 
rank(A) < fc} defines the optimization problem: 



VkiX) E argmin rr-X 



(5) 



According to the Eckart- Young theorem 1 28 1, the best rank- 
fc approximation of a matrix X corresponds to its truncated 
SVD: if X = USV^, then Vk{X) ■- Uk^kVl where 
Sk G M'^'^'^ is a diagonal matrix that contains the first fc 
diagonal entries of S and U^, V k contain the correspond- 
ing left and right singular vectors, respectively. Moreover, 



In the algorithmic descriptions, we use S -^ Vk {X) 
to denote the set of rank-1, orthonormal matrices as outer 
products of the fc left Ui and right Vi principal singular vec- 
tors of X that span the best rank-fc subspace of X; e.g. 
S = {uiVi, i = 1, . . . , fc}. Moreover, X ^ Vk {X) de- 
notes a/the best rank-fc projection matrix of X. In some 
cases, we use {S, X} ^— Vk {X) when we compute both. 
The distiction between these cases is apparent from the con- 
text. 



2.3 Restricted Isometry Property 

Many conditions have been proposed in the literature to es- 
tablish solution uniqueness and recovery stability such as 
null space property [29 J, exact recovery condition [30], etc. 
For the matrix case, P3| proposed the restricted isometry 
property (RIP) for the ARM problem. 

' The distinction between Vs and Vh for k positive integer is appar- 
ent from context. 
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Definition 3 [Rank Restricted Isometry Property (R-RIP) 
for matrix linear operators |13|] A linear operator ^ ; M™^" 
^ RP satisfies the R-RIP with constant Sk{A) G (0, 1) if 
and only if: 



(1-4(^))||X||1< 



\AX\\l<il + Sk{A))\\X\\l, (8) 



VX e K"*^" such that rank(X) < k. We write 4 to mean 
6k{A.), unless otherwise stated. 

Q shows that Pauli operators satisfy the rank-RIP in com- 
pressive settings while, in function learning, the linear map 
A is designed specifically to satisfy the rank-RIP ||7|. 

2.4 Some useful bounds using R-RIP 

In this section, we present some lemmas that are useful in 
our subsequent developments — these lemmas are consequen- 
ces of the R-RIP of A 



Lemma 2 (^ Let A : W""""^ ^ Rp be a linear operator 
that satisfies the R-RIP with constant S^- Then, Wv € RP, 
the following holds true: 

\\Vs{A*v)\\p< ^l + Sk\\v\\^, (9) 

where S is a set of orthonormal, rank-1 matrices in M™^" 
such that rank{VsX) < k, \fX e M™""". 

Lemma 3 /|27j/ Let A : M'"""" ^ Rp be a linear opera- 
tor that satisfies the R-RIP with constant 5],. Then, yX G 
M™^", the following holds true: 

(1 - dk)\\rsX\\^ < \\VsA*AVsX\\^ 

< (1 + 4)11^5^11^, (10) 

where S is a set of orthonormal, rank-1 matrices in K™^" 
such that rank{VsX) < fc, VX e M™><". 

Lemma 4 1^ Let A : M"''" -^ Rp be a linear op- 
erator that satisfies the R-RIP with constant Sk and S be 
a set of orthonormal, rank-1 matrices in M™^" such that 
rank{VsX) < fc, VX e M"^". Then, for fi > 0, A satis- 
fies: 

Xi^irsA*AVs)e [Ai(l-4),A*(l + 4)]. (11) 

where X{3) represents the range of eigenvalues of the linear 
operator B : RP ^ M™><". Moreover, VX G M'"^^", it 
follows that: 

\\{I-^IVsA*AVs)VsX\\p 

< max{fi{l + 4) - 1, 1 - A*(l - Sk)} W'PsXWp. (12) 

Lemma 5 f^ Let A : R"^'^" ^ Rp be a linear operator 
that satisfies the R-RIP with constant Sk and Si , S2 be two 
sets of orthonormal, rank-1 matrices in K™^" such that 

rankiVs.us^X) < fc, VX e M"^". (13) 

Then, the following inequality holds: 

\\Vs,A*Ars^X\\j^ < 6k\\Vs^X\\pyX e spaniS2). 

(14) 



3 Algrebraic Pursuits in a nutsliell 



Explicit descriptions of the proposed algorithms are pro- 
vided in Algorithms 1 and 2. Algorithm 1 follows from the 
ALgrebraic Pursuits (ALPS) scheme for the vector case 1 3 1 1 . 
Matrix ALPS I provides efficient strategies for adaptive 
step size selection and additional signal estimate updates 
at each iteration (these motions are explained in detail in 
the next subsection). Algorithm 2 (ADMiRA) fill further 
improves the performance of Algorithm 1 by introducing 
least squares optimization steps on restricted subspaces — 
this technique borrows from a series of vector reconstruction 
algorithms such as CoSaMP |32|, Subspace Pursuit (SP) 
pB| and Hai-d Thresholding Pursuit (HTP) |34| . 

In a nutshell, both algorithms simply seek to improve 
the subspace selection by iteratively collecting an extended 
subspace Si with rank(span(5i)) < 2fc and then finding the 
rank-fc matrix that fits the measurements in this restricted 
subspace using least squares or gradient descent motions. 

At each iteration, the Algorithms 1 and 2 perform mo- 
tions from the following Ust: 



1) Best rank-k subspace orthogonal to Xi and ac- 
tive subspace expansion: We identify the best rank-fc 
subspace of the current gradient V/(X(i)), orthogonal 
to Xi and then merge this low -rank subspace with Xi. 
This motion guarantees that, at each iteration, we ex- 
pand the current rank-A; subspace estimate with k new, 
rank-1 orthogonal subspaces to explore. 

2a) Error norm reduction via greedy descent with 
adaptive step size selection (Algorithm 1): We decrease 
the data error by performing a single gradient descent 
step. This scheme is based on a one-shot step size se- 
lection procedure (Step size selection step) — detailed 
description of this approach is given in SectionH] 

2b) Error norm reduction via least squares opti- 
mization (Algorithm 2): We decrease the data error 
/(X) on the active 0(fc)-low rank subspace. Assum- 
ing A is well-conditioned over low -rank subspaces, the 
main complexity of this operation is dominated by the 
solution of a symmetric linear system of equations. 

3) Best rank-k subspace selection: We project the 
constrained solution onto the set of rank-fc matrices 
Ck ■■= {A e M"^" : rank(A) < fc} to arbitrate 
the active support set. This step is calculated in poly- 
nomial time complexity as a function of m x n us- 
ing SVD or other matrix rank-revealing decomposition 
algorithms — further discussions about this step and its 
approximations can be found in Sections [8] and l9] 

4) De-bias using gradient descent (Algorithm 1): 
We de-bias the current estimate W{i) by performing an 
additional gradient descent step, decreasing the data er- 
ror. The step size selection procedure follows the same 
motions as in 2a). 
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Input: y. A., k. Tolerance r], Maxlterations 
Initialize: X{0) ^ 0, Xq ^ {0}, i ^ 
repeat 

Si^ViU Xi 

ti, ^ argmin^ ||y - A{X{i) - f Pg, V/(X(i))) H^ = 

V(i)^X(i)-fPs,V/(X(i)) 

{W„ W(i)}+-Pft(V(i)) 

?. ^ argmin^ \\v-A{W{t) - |Pw, V/(W(i))) ||' - y^p^^ v/(WW)||^ 

X(i + 1)^ W(i)-%Pw,V/(W(j)) withA-.+ i ^Pfc(X(i + l)) 
i <— i + 1 
until ||X(j) - X{i - 1)||2 < ??||X(i)||2 or Maxlterations. 



||AP5,V/(Jf{i))ll 



llPvv.v/cwCi))!!! 



(Beif rank-k subspace orthogonal to Xi) 

(Active subspace expansion) 

(Step size selection) 

(Error norm reduction via gradient descent) 
(Best rank-k subspace selection) 

(Step size selection) 

(De-bias using gradient descent) 



Algorithm 1: Matrix ALPS I 



Input: y, A., k. Tolerance rj, Maxlterations 
Initialize: X(0) ^ 0, A'o ^ {0}, i ^ 
repeat 

1: V,^Vk{V;^±\'f{X(i))) 

2: Si^ViU Xi 

3: V{i) i- a.igmm^r.,v&p:m{S,) h ^ ^'^Wl 
4: {Xi+i, X{i+l)}^Vk{V{i)) 
i •(— i + 1 
until \\X{i) - X{i - 1)||2 < ??||X(j)||2 or Maxlterations. 



(Best rank-k subspace orthogonal to Xi) 

(Active subspace expansion) 

(Error norm reduction via least-squares optimization) 

(Best rank-k subspace selection) 



Algorithm 2: ADMiRA Instance 



4 Ingredients for hard thresholding methods 

4. 1 Step size selection 

For the sparse vector approximation problem, recent works 
on the performance of the IHT algorithm provide strong 



convergence rate guarantees in terms of RIP constants 1 35 1. 
However, as a prerequisite to achieve these strong isometry 
constant bounds, the step size is set /^^ = l,Vi, given that 
the sensing matrix satisfies ||^||2 < 1 where || • ||2 denotes 
the spectral norm [34J ; similar analysis can be found in \3j 
for the matrix case. From a different perspective, p6) pro- 
poses a constant step size ^i = 1/(1 + 52k): Vi, based on 
a simple but intuitive convergence analysis of the gradient 
descent method. 

Unfortunately, most of the above problem assumptions 
are not naturally met; the authors in |37| provide an intu- 
itive example where IHT algorithm behaves differently un- 
der various scalings of the sensing matrix; similar coun- 
terexamples can be devised for the matrix case. Violating 
these assumptions usually leads to unpredictable signal re- 
covery performance of the class of hard thresholding meth- 
ods. Therefore, more sophisticated step size selection pro- 
cedures should be devised to tackle these issues during ac- 
tual recovery. On the other hand, the computation of R-RIP 
constants has exponential time complexity for the strategy 
oft3J. 

To this end, existing approaches broadly fall into two 
categories: constant and adaptive step size selection. In this 
work, we present efficient strategies to adaptively select the 
step size /i; that implies fast convergence rate, for mild R- 



RIP assumptions on A.. Constant step size strategies easily 
follow from |24| and are not listed in this work. 

Adaptive step size selection. There is limited work on 
the adaptive step size selection for hard thresholding meth- 
ods. To the best of our knowledge, apart from p4[ , (37)- p8) 
are the only studies that attempt this via line searching for 
the vector case. At the time of review process, we become 
aware of |39| which implements ideas presented in p7) for 
the matrix case. 

According to Algorithm 1, let X{i) be the current rank- 
k matrix estimate spanned by the set of orthonormal, rank-1 
matrices in Xi. Using regular gradient descent motions, the 
new rank-fc estimate W{i) can be calculated through: 



Vi = X(^)-^yJ{X{^)), 



{Wi, W{i)}^Vk{V{i)). 



We highlight that the rank-/c approximate matrix may not 
be unique. It then holds that the subspace spanned by Wi 
originates: i) either from the subspace of Xi, ii) or from the 
best subspace (in terms of the Frobenius norm metric) of the 
current gradient V/(X(i)), orthogonal to Xi, in) or from 
the combination of orthonormal, rank-1 matrices lying on 
the union of the above two subspaces. The statements above 
can be summarized in the following expression: 



span(yVi) e span (Vi U X^) 



(15) 



for any step size /i.^ and Vi <— Vk{'Px:^'^ f{X{i))). Since 
rank(span(>Vi)) < k, we easily deduce the following key 
observation: let Si -s— ViUXi be a set of rank-1, orthonormal 
matrices where rank(span(5i)) < 2k. Given Wi is unknown 
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Matrix ALPS I [40.6288] 
SVP - ft = 1 - 145.571] 
SVP - ft = 1/2 - [83.4564] 
SVP - ^1, = 1/4 - [154.5011] 




40 60 80 

# of iterations 

(a) 



^ 


Matrix ALPS I [9.9404] 

SVP-ft =1- [13.3802] 

SVP -ft =1/2- [24.2676] 

SVP - ft = 1/4 - [42.7626] 


^ 


s. 



40 60 80 

# of iterations 



-Matrix ALPS I [6.9586] 

-Matrix ALPS I - no arid, updates - [10.2832] 




10 15 20 

# of iterations 



(b) 



(c) 



Fig. 1 Median error per iteration for various step size policies and 20 Monte-Carlo repetitions. In brackets, we present the median time consumed 
for convergene in seconds, (a) m = n = 2048, p = 0.4n^, and rank k = 70 — A. is formed by permuted and subsampled noiselets (40]. (b) 
n = 2048 , m = 512, p = 0.4?i^, and rank k = 50 — we use underdetermined linear map A. according to the MC problem (c) n = 2048, 
m = 512, p = 0.4n^ , and rank k = 40 — we use underdetermined linear map A according to the MC problem. 



before the i-th iteration. Si spans the smallest subspace that 
contains Wi such that the following equality 



r, (x(») - |v/(x(*)) 

= 7'.(xW-|7'5.V/(X(z)) 

necessarily holdsrl 

To compute step-size fii, we use: 



/i, = argmin \\y - A (x(z) - f^Vs^ f{X{i))) \\l 

iwsyfixmii 



\APsyfixm\ 



(17) 



i.e., fii is the minimizer of the objective function, given the 
current gradient V/(X(i)). Note that: 



l-S2k{A)<-<l + 52kiA), 

IJ-i 



(18) 



due to R-RIP — i.e., we select 2fc subspaces such that /i^ sat- 
isfies ( [T8] l. We can derive similar arguments for the addi- 
tional step size selection ^j in Step 6 of Algorithm 1. 

Adaptive fii scheme results in more restrictive worst- 
case isometry constants compared to |3, 34,4i|, but faster 
convergence and better stability are empirically observed in 
general. In |[3), the authors present the Singular Value Pro- 
jection (SVP) algorithm, an iterative hard thresholding al- 
gorithm for the ARM problem. According to |3 1, both con- 
stant and iteration dependent (but user-defined) step sizes 
are considered. Adaptive strategies presented in |3| require 
the computation of R-RIP constants which has exponential 
time complexity. Figures l(a)-(b) illustrate some character- 
istic examples. The performance varies for different prob- 
lem configurations. For /i > 1, SVP diverges for various 
test cases. We note that, for large fixed matrix dimensions 

^ In the case of multiple identical singulai' values, any ties are lexi- 
cographically dissolved. 



m, n, adaptive step size selection becomes computationally 
expensive compared to constant step size selection strate- 
gies, as the rank of X* increases. 



Qg-. 4.2 Updates on restricted subspaces 

In Algorithm 1, at each iteration, the new estimate W{i) -f- 
Vk iV{i)) can be further refined by applying a single or 
multiple gradient descent updates with line search restricted 
on Wi |34| (Step 7 in Algorithm 1): 

X(^ + 1) ^ W(i) - ^rw,VfiW{{)), 

where C, = ||'!Ip^J'v/CM^'(0)||i ■ ^" ^P""^'' *^ gradient step 
above is the same as block coordinate descent in convex op- 
timization where we find the subspaces adaptively. Figure 
1(c) depicts the acceleration achieved by using additional 
gradient updates over restricted low-rank subspaces for a 
test case. 



4.3 Acceleration via memory-based schemes and low-rank 
matrix approximations 

Memory-based techniques can be used to improve conver- 
gence speed. Furthermore, low-rank matrix approximation 
tools overcome the computational overhead of computing 
the best low-rank projection by inexactly solving ( [5]l. We 
keep the discussion on memory utilization for Section[7Jand 
low-rank matrix approximations for Sections I8]andl9]where 
we present new algorithmic frameworks for low-rank matrix 
recovery. 



4.4 Active low-rank subspace tracking 

Per iteration of Algorithms 1 and 2, we perform projection 
operations VsX and Vs±X where X e M"><", as de- 
scribed by (|6| and (J7]i, respectively. Since S is constituted 
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Fig. 2 Median error per iteration for MATRIX ALPS I and MATRIX ALPS II variants over 10 Monte-Carlo repetitions. In brackets, we present 
the median time consumed for convergene in seconds, (a) n = 2048, m = 512, p = 0.25n^, and rank k = 40. (b) n = 2000, m = 1000, 
p = 0.25n-^, and rank k = 50. (c) n = m = 1000, p = 0.25?i^, and rank k = 50. 



by outer products of left and right singular vectors as in 
Definition l2] VsX (resp. 'Ps±X) projects onto the (resp. 
complement of the) best low-rank subspace in R{X) and 
R{X ). These operations are highly connected with the 
adaptive step size selection and the updates on restricted 
subspaces. Unfortunately, the time-complexity to compute 
VsX is dominated by three matrix-matrix multiplications 
which decelerates the convergence of the proposed schemes 
in high-dimensional settings. To accelerate the convergence 
in many test cases, it turns out that we do not have to use the 
best projection Vs in practice PI Rather, employing inexact 
projections is sufficient to converge to the optimal solution: 
either i) VuX onto the best low-rank subspace in R{X) 
only (if TO <C n) or ii) XVy onto the best low-rank sub- 
space in R{X ) only (if m ^ nr\ Vu and Vv are defined 
in Definition [2] and require only one matrix-matrix multipli- 
cation. 

Figure 2 shows the time overhead due to the exact pro- 
jection application Vs compared to Vu for m < n. In Figure 
2(a), we use subsampled and permuted noiselets for linear 
map A. and in Figures 2(b)-(c), we test the MC problem. 
While in the case m — n the use of (|6])-(|7| has a clear ad- 
vantage over inexact projections using only Vu, the latter 
case converges faster to the desired accuracy 5 • 10^^ when 
771 ^ n as shown in Figures 2(a)-(b). In our derivations, we 
assume Vs and Vs^ as defined in (|6]l and (|7]i. 



5 Convergence guarantees 

In this section, we present the theoretical convergence guar- 
antees of Algorithms 1 and 2 as functions of R-RIP con- 
stants. To characterize the performance of the proposed al- 
gorithms, both in terms of convergence rate and noise re- 
silience, we use the following recursive expression: 



In (19 1, 7 denotes the approximation guarantee and provides 



|X(^ + 1)-X*\\f< p\\X(i) - X*\\p+^\\e\ 



(19) 



' From a different perspective and for a different problem case, sim- 
ilar ideas have been used in 1 18 1. 
^ We can move between these two cases by a simple transpose of 



insights into algorithm's reconstruction capabilities when ad- 
ditive noise is present; p <\ expresses the convergence rate 
towards a region around X*, whose radius is determined by 
£||2. In short, (19 1 characterizes how the distance to the 



i-p 



true signal X* is decreased and how the noise level affects 
the accuracy of the solution, at each iteration. 



5.1 Matrix ALPS I 

An important lemma for our derivations below is given next: 

Lemma 6 [Active subspace expansion] Let X{i)be the ma- 
trix estimate at the i-th iteration and let Xi be a set of or- 
thonormal, rank-1 matrices such that Xi -s— Vk{X{i)). Then, 
at each iteration, the Active Subspace Expansion step in Al- 
gorithms 1 and 2 identifies information in X* , such that: 



\Vx'Vs^X*\\p<{25: 



2fc 



25: 



ik, 



\X{i) -X* 



+ ^2{l + 52k)\\e\\^, 
where 5^ 4— A^ U Vi and X* <— Vk{X*). 



(20) 



Lemma |6] states that, at each iteration, the active sub- 
space expansion step identifies a 2k rank subspace such that 
the amount of unrecovered energy of X* — i.e., the projec- 
tion of X* onto the orthogonal subspace of span(5i) — is 
bounded by (20i. 

Then, Theorem 1 characterizes the iteration invariant of 
Algorithm 1 for the matrix case: 

Theorem 1 [Iteration invariant for MATRIX ALPS \] The 

(i + l)-th matrix estimate X{i + 1) of MATRIX ALPS I 
satisfies the following recursion: 



\\X{i + l)-X*\\^<p\\X{t)-X*\\^+4e\\^, 
where p := {^ (^ + (26,, + 26,,)^) 



the problem. 



1-5. 

Moreover, when S^i^ 
tive. 



(21) 
and 



1 — 152*; 1 — <52) 



v/2(iTM 

< 0.1235, the iterations are contrac- 



l-Sk 
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To provide some intuition behind this result, assume that 
X* is a rank-fc matrix. Then, according to Theorem fll for 
p < 1, the approximation parameter 7 in (21 1 satisfies: 

7 < 5.7624, for Sak < 0.1235. 

Moreover, we derive the following: 



P< 



1 + 25: 



3fe 



(4,5: 



3fe 



sslk) 



< 



1 



Ssk < 0.079, 



which is a stronger R-RIP condition assumption compared 
to state-of-the-art approaches |21|. In the next section, we 
further improve this guarantee using Algorithm 2. 

Unfolding the recursive formula (21 1, we obtain the fol- 
lowing upper bound for 1 1 X (i) — X* jnTat the i-th iteration: 

\\X{i) ~ X*\\^ < p'\\X{0) - x*\\ 



7 



after i 



Iterations. 



T „- 2- (22) 
1 — p" "^ 

Then, given X{0) = 0, MATRIX ALPS I finds a rank-fc 

solutionX gM"''" such that II X-X*|L < ^^P+^We" 

" log(||X-||F/||e 
log(l/p) 

If we ignore steps 5 and 6 in Algorithm 1, we obtain 
another projected gradient descent variant for the affine rank 
minimization problem, for which we obtain the following 
performance guarantees — the proof follows from the proof 
of Theorem [T] 

Corollary 1 /Matrix ALPS I Instance] In Algorithm I, 
we ignore steps 5 and 6 and let {A^+i, X{i + 1)} -S— 
TkiVi). Then, by the same analysis, we observe that the 
following recursion is satisfied: 



\X{i + l)-X*\\^<p\\X{t)-X*\\^ 



7||£||2 
and-1 — ' ^^i±5 



(23) 



forp := {^ + i2S2k+2Ssk)^) andj := (^^ 

a/2(1 + S2k))- Moreover, p <1 when 63k < 0.1594. 



_2S3 

l-S; 



We observe that the absence of the additional estimate 
update over restricted support sets results in less restrictive 
isometry constants compared to Theorem [T] In practice, ad- 
ditional updates result in faster convergence, as shown in 
Figure 1(c). 



5.2 ADMiRA Instance 

In Matrix ALPS I, the gradient descent steps constitute a 
first-order approximation to least-squares minimization prob- 
lems. Replacing Step 4 in Algorithm 1 with the following 
optimization problem: 



Vi^) 



arg mm 

V:Vespan(5i; 



\y-AV 



we obtain ADMiRA (furthermore, we remove the de-bias 
step in Algorithm 1). Assuming that the linear operator A., 
restricted on sufficiently low-rank subspaces, is well condi- 
tioned in terms of the R-RIP assumption, the optimization 
problem ( [24| has a unique optimal minimizer By exploiting 
the optimality condition in Lemma [T] ADMiRA instance in 
Algorithm 2 features the following guarantee: 



Theorem 2 [Iteration invariant for ADMiRA instance] The 
(i + l)-th matrix estimate X{i-\-l) of ADMiRA answers the 
following recursive expression: 



'•- , and 7 



im\F 



v/2(l + 53fc) 



\\X{i + l)-X*\\p<p\\X{t)-X*\\p 
P- (252fe+253fc)y'it^ 

+ ( ^i^_^/^f " +V3\ VI + 62k- Moreover, when 63k < 0.2267, 
the iterations are contractive. 

Similarly to MATRIX ALPS I analysis, the parameter 7 
in Theorem|2]satisfies: 

7 < 5.1848, for^sfc < 0.2267. 

Furthermore, to compare the approximation guarantees of 
Theorem[2]with |21 1, we further observe: 



2 Jafc < 0.1214, forp < 1/2. 

We remind that |21 1 provides convergence guarantees for 
ADMiRA with 64k < 0.04 for p = 1/2. 



6 Complexity Analysis 

In each iteration, computational requirements of the pro- 
posed hard thresholding methods mainly depend on the total 
number of linear mapping operations A^ gradient descent 
steps, least-squares optimizations, projection operations and 
matrix decompositions for low rank approximation. Differ- 
ent algorithmic configurations (e.g. removing steps 6 and 

7 in Algorithm 1) lead to hard thresholding variants with 
less computational complexity per iteration and better R- 
RIP conditions for convergence but a degraded performance 
in terms of stability and convergence speed is observed in 
practice. On the other hand, these additional processing steps 
increase the required time-complexity per iteration; hence, 
low iteration counts are desired to tradeoff these operations. 

A non-exhaustive list of linear map examples includes 
the identity operator (Principal component analysis (PCA) 
problem), FourierAVavelets/Noiselets tranformations and the 
famous Matrix Completion problem where ^ is a mask op- 
erator such that only a fraction of elements in X is ob- 
served. Assuming the most demanding case where A. and 
A.* are dense linear maps with no structure, the compu- 
tation of the gradient W f{X{i)) at each iteration requires 
0{pkmn) arithmetic operations. 

Given a set S of orthonormal, rank-1 matrices, the pro- 
jection VsX for any matrix X E jj™^" requires time com- 
^ ' plexity 0{nia,x{m^n, mn^) as a sequence of matrix-matrix 
multiplication operations |^In MATRIX ALPS I, the adap- 



tive step size selection steps require 0{raax{pkmn, m^n 



}) 



time complexity for the calculation of pi and ^^ quantities. In 

^ While such operation has 0(max{m-^n, mn'^ }) complexity, each 
application of VsX requires three matrix-matrix multiplications. To 
reduce such computational cost, we relax this operation in Section 
where in practice we use only Vu that needs one matrix-matrix mu 
phcation. 
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ADMiRA solving a least-squares system restricted on rank- 
2k and rank-fc subspaces requires 0{pk^) complexity; ac- 
cording to 1 32 1, pT), the complexity of this step can be fur- 



ther reduced using iterative techniques such as the Richard- 
son method or conjugate gradients algorithm. 

Using the Lanczos method, we require 0{kmn) arith- 
metic operations to compute a rank-fc matrix approximation 
for a given constant accuracy; a prohibitive time-complexity 
that does not scale well for many practical applications. Sec- 
tions IS] and l9] describe approximate low rank matrix projec- 
tions and how they affect the convergence guarantees of the 
proposed algorithms. 

Overall, the operation that dominates per iteration re- 
quires 0{m.a,x{pkmn, m^n, mri^}) time complexity in the 
proposed schemes. 



7 Memory-based Acceleration 

Iterative algorithms can use memory to gain momentum in 
convergence. Based on Nesterov's optimal gradient methods 
142), we propose a hard thresholding variant, described in 
Algorithm 3 where an additional update on X{i + I) with 
momentum step size r^ is performed using previous matrix 
estimates. 

Similar to /i^ strategies, r^ can be preset as constant or 
adaptively computed at each iteration. Constant momentum 
step size selection has no additional computational cost but 
convergence rate acceleration is not guaranteed for some 
problem formulations in practice. On the other hand, em- 
pirical evidence has shown that adaptive t^ selection strate- 
gies result in faster convergence compared to zero-memory 
methods with similar complexity. 

For the case of strongly convex objective functions, Nes- 
terov | [43) proposed the following constant momentum step 

, where ao G (0, 1) 



size selection scheme: t, 



and cti+i is computed as the root € (0, 1) of 



(1 



-i)al + qa.,+i, for q = 



k2(^)' 



(25) 



where Hi{A) denotes the condition number of A.. In this 
scheme, exact calculation of q parameter is computationally 
expensive for large-scale data problems and approximation 
schemes are leveraged to compensate this complexity bot- 
tleneck. 

Based upon adaptive jii selection, we propose to select 
Ti as the minimizer of the objective function: 



Ti = argmin \\y — A.Q{i + 1) 



(y - AX{i),AX{i) - AX{i - 1)) 



\AX( 



AX( 



(26) 



where AX {{}, AX{i—l) are aheady pre -computed at each 
iteration. According to (J26]l, r^ is dominated by the calcula- 
tion of a vector inner product, a computationally cheaper 
process than q calculation. 

Theorem l3] characterizes Algorithm 3 for constant mo- 
mentum step size selection. To keep the main ideas simple. 



-Matrix ALPS II - t, = opt. [48.15] 
-Matrix ALPS II - t, = 1/4 [71.6347] 
-Matrix ALPS II - t, = 1/8 [107.7316] 
-Matrix ALPS II - 7 1 'IC. '1111,4127] 
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50 
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^ of iterations 



Fig. 3 Median error per itemtion for various momentum step size poli- 
cies and 10 Monte-Carlo repetitions. Here, n = 1024, m = 256, 
p = 0.25n^, and rank k = 40. We use permuted and subsampled 
noiselets for the linear map A.. In brackets, we present the median time 
for convergence in seconds. 



we ignore the additional gradient updates in Algorithm 3. 
In addition, we only consider the noiseless case for clarity. 
The convergence rate proof for these cases is provided in the 
appendix. 

Theorem 3 [Iteration invariant for MATRIX ALPS 117 Let 
y = Ax* be a noiseless set of observations. To recover 
X* from y and A, the {i + l)-th matrix estimate X(i -\- 1) 
of Matrix ALPS II satisfies the following recursion: 



\X{i + 1) - X*||^ < a(l + n)\\X{i) - X* 
+ q;tJ X(i — 1) — X*\\ 



(27) 



l-Ss 



(2J3fc+2(54fc) ^_|'° ■ Moreover, solving 



where a ;= 

the above second-order recurrence, the following inequality 

holds true: 



\\X{i + I) - X*\\p < p'+^\\X{0) 



-X* 



If' 



(28) 



forp 



a{l+Ti)-i-^a^{l+Ti)^+4:aTi 



Theorem l3] provides convergence rate behaviour proof 
for the case where r^ is constant Vi. The more elaborate case 
where r^ follows the policy described in ( 26 1 is left as an 



open question for future work. To provide some insight for 
(|28]l, for n = 1/4, Vi and n = 1/2, Vi, Sik < 0.1187 



and 64k < 0.095 guarantee convergence in Algorithm 3, re- 
spectively. While the RIP requirements for memory-based 
Matrix ALPS II are more stringent than the schemes pro- 
posed in the previous section, it outperforms Algorithms 1 
and 2. Figure 2 shows the acceleration achieved in MATRIX 
ALPS II by using inexact projections Vu- Using the proper 
projections (|6])-(|7]i. Figure [3] shows acceleration in practice 
when using the adaptive momentum step size strategy: while 
a wide range of constant momentum step sizes leads to con- 
vergence, providing flexibility to select an appropriate Ti, 
adaptive t,; avoids this arbitrary t^ selection while further 
decreases the number of iterations needed for convergence 
in most cases. 
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Input: y. A., k. Tolerance r], Maxlterations 
Initialize: X(0) ^ 0, Xq ^ {0}, Q(0) ^ 0, S 
repeat 

p,, ^ argmin^ \\y - A{Q{i) - f^s. V/(Q(i))) \\l = 
V(i)^Q(i)- fPs.V/CQW) 

Q{i + 1) ^ X{i + 1) + r,(X(i + 1) - X(j)) 
Si+i ^ortho(A'iUA'i+i) 
i •(— i + 1 
until ||X(j) - X{i - 1)||2 < ri\\X{i)\\2 or Maxlterations 



{0}, Ti VJ, i ^ 



I|-P5,v/(Q('))IIf 
ll-AT^s.-V/CQCi))!!! 



fBei? rank-k subspace orthogonal to Q J 

(Acfjve subspace expansion) 

(Step size selection) 

(Error norm reduction via gradient descent) 

(Best rank-k subspace selection) 

(Momentum update) 



Algorithm 3: Matrix ALPS II 



8 Accelerating Matrix ALPS: e-Approximation of 
SVD via Column Subset Selection 

A time-complexity bottleneck in the proposed schemes is 
the computation of the singular value decomposition to find 
subspaces that describe the unexplored information in ma- 
trix X* . Unfortunately, the computational cost of regular 
SVD for best subspace tracking is prohibitive for many ap- 
plications. 

Based on |44 45|, we can obtain randomized SVD ap- 
proximations of a matrix X using column subset selection 
ideas: we compute a leverage score for each column that 
represents its "significance". In particular, we define a prob- 
ability distribution that weights each column depending on 
the amount of information they contain; usually, the distri- 
bution is related to the ^2 -norm of the columns. The main 
idea of this approach is to compute a surrogate rank-fc ma- 
trix VKX) by subsampling the columns according to this 
distribution. It turns out that the total number of sampled 
columns is a function of the parameter e. Moreover, 1 46 47) 
proved that, given a target rank k and an approximation pa- 
rameter e, we can compute an e-approximate rank-fc matrix 
Vl{X) according to the following defintion. 

Definition 4 [e-approximate low-rank projection] Let X be 
an arbitrary matrix. Then, V^X) projection provides a rank- 
fc matrix approximation to X such that: 



riiX)~X\\'<il + e)\\Vk{X)-X 



|2 

If 



(29) 



where VkiX) £ argmin-y^.j. 



■ank(y)</£ 



\X -Y\ 



For the following theoretical results, we assume the fol- 
lowing condition on the sensing operator A. : ||^*/3||„ < 
A, V/3 G WP where A > 0. Using e-approximation schemes 
to perform the Active subspace selection step, the following 
upper bound holds. The proof is provided in the Appendix: 

Lemma 7 [e-approximate active subspace expansion] Let 
X(i) be the matrix estimate at the i-th iteration and let Xi 
be a set of orthonormal, rank-1 matrices in K™^"- such that 
Xi <— Vk{X{i)). Furthermore, let 

Vl^Vl{V;,.^f{X(z))), 



be a set of orthonormal, rank-1 matrices that span rank-k 
subspace such that {29 \ is satisfied for X :— V^^V f{X{i)). 
Then, at each iteration, the Active Subspace Expansion step 
in Algorithms 1 and 2 captures information contained in the 
true matrix X* , such that: 



<(252fe + 2(53fc)||X(z)-X*|[ 
+ 2AVe, 

where Si ^ Xi\J Vf and X* ^ Vk{X*). 

Furthermore, to prove the following theorems, we ex- 



v/2(l + (52fc)||e||2 
(30) 



tend Lemma 10 provided in the Appendix, as follows. The 



proof easily follows from the proof of Lemma 10 using Def- 
inition H) 

Lemma 8 [e-approximation rank-k subspace selection] Let 
V{i) be a rank-2k proxy matrix in the subspace spanned by 
Si and let W{i) <— 'P'fc(^(*)) denote the rank-k e-approxi- 
mation to V{i), according to pi). Then: 

\\W{i)-V{i)f^<{l + e)\\W{i)-V{i)\\p 

<{l + e)\\VsAV{i)-X*)\\p 

<(l + e)||y(z)-X*||^ (31) 

where W{i) ^Vk{V{i)). 



8.1 Matrix ALPS I using e-approximate low-rank 
projection via column subset selection 

Using e-approximate SVD in MATRIX ALPS I, the follow- 
ing iteration invariant theorem holds: 

Theorem 4 [Iteration invariant with e-approximate projec- 
tions for Matrix ALPS 17 The {i + \)-th matrix estimate 
X{i + 1) of Matrix ALPS I with e-approximate projec- 
tions VI ^ Vl{Vx±Vf{X{i))) and W{i) ^ 7'^(V(z)) 
in Algorithm 1 satisfies the following recursion: 



\X{i + 1) - XlL < p\\X{i) - x*\L+^\\eh 



/3A, 
(32) 
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where p : 
/3:=fl 



1 



2A"2fc 



(2 + e)(l + j^)2^e,and 






,\/T+Si 
' l-<52fc 



10" 



10 - 



^10' 

k 



10" 



10 



7^= 1 + T^ (2 + ^) (l + T%7)v/2(lTM + 







Matrix ALPS II - e = 
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Fig. 4 Performance comparison using e-approximation SVD \47] in 
Matrix ALPS II. m = n = 256, p = OAn?, rank of X* equals 2 
and A. constituted by permuted noiselets. The non-smoothness in the 
error curves is due to the extreme low rankness ofX* for this setting. 



e-approximate low-rank projections, as expected. Neverthe- 
less, we observe that, on average, the column subset selec- 
tion process BTl is computationally prohibitive compared 
to regular SVD due to the time overhead in the column se- 
lection procedure — fewer passes over the data are desirable 
in practice to tradeoff the increased number of iterations 
for convergence. In the next section, we present alternatives 
based on recent trends in randomized matrix decompositions 
and how we can use them in low-rank recovery. 



9 Accelerating Matrix ALPS: SVD Approximation 
using Randomized Matrix Decompositions 

Finding low-cost SVD approximations to tackle the above 
complexity issues is a challenging task. Recent works on 
probabilistic methods for matrix approximation |26| provide 
a family of efficient approximate projections on the set of 
rank-deficient matrices with clear computational advantages 
over regular SVD computation in practice and attractive the- 
oretical guarantees. In this work, we build on the low-cost, 
power-iteration subspace tracking scheme, described in Al- 
gorithms 4.3 and 4.4 in p6| . Our proposed algorithm is de- 
scribed in Algorithm 4. 

The convergence guarantees of Algorithm 4 follow the 
same motions described in Section |8] where e is a function 
of 771, 71, k and q. 



Similar analysis can be conducted for the ADMiRA al- 
gorithm. To illustrate the impact of SVD e-approximation on 
the signal reconstruction performance of the proposed meth- 
ods, we replace the best rank-A; projections in steps 1 and 
5 of Algorithm 1 by the e-approximation SVD algorithm, 
presented in |47]. In this paper, the column subset selection 
algorithm satisfies the following theorem: 

Tlieorem 5 Let X E jj™^" fg a signal of interest with ar- 
bitrary rank < min{rn, n} and let Xk represent the best 
rank-k approximation of X. After 2(fc + l)(log(/c + 1) + 1) 
passes over the data, the Linear Time Low-Rank Matrix Ap- 



proximation algorithm in [471 computes a rank-k approxi- 
mation VKX) e i^mx" such that Definitionmis satisfied 
with probability at least 3/4. 

The proof is provided in pT). In total. Linear Time Low- 
Rank Matrix Approximation algorithm pTJ requires 0{mn 
(fc/e + fc2 log k) + (m + n)(fc2/e2 + fc2 log fc/e +fc'' log^ k)) 
and 0{mm{m,n}{k/e + k^logk)) time and space com- 
plexity, respectively. However, while column subset selec- 
tion methods such as 1 ,47] reduce the overall complexity of 
low-rank projections in theory, in practice this applies only 
in very high-dimensional settings. To strengthen this argu- 
ment, in Figure |4] we compare SVD-based MATRIX ALPS 
II with Matrix ALPS II using the e-approximate column 
subset selection method in pTj. We observe that the to- 
tal number of iterations for convergence increases due to 



10 Experiments 

10. 1 List of algorithms 

In the following experiments, we compare the following al- 
gorithms: (j) the Singular Value Projection (SVP) algorithm 
p], a non-convex first-order projected gradient descent al- 
gorithm with constant step size selection (we study the case 
where p = 1), (m) the inexact ALM algorithm |18J based on 
augmented Langrance multiplier method, {Hi) the OptSpace 
algorithm |48|, a gradient descent algorithm on the Grass- 
mann manifold, (iv) the Grassmannian Rank-One Update 
Subspace Estimation (GROUSE) and the Grassmannian Ro- 
bust Adaptive Subspace Tracking methods (GRASTA) 1 49 
[50| , two stochastic gradient descent algorithms that operate 
on the Grassmannian — moreover, to allay the impact of out- 
liers in the subspace selection step, GRASTA incorporates 
the augmented Lagrangian of ^i-norm loss function into the 
Grassmannian optimization framework, (v) the Riemannian 
Trust Region Matrix Completion algorithm (RTRMC) |[5l|, 
a matrix completion method using first- and second-order 
Riemannian trust-region approaches, (vi) the Low rank Ma- 
trix Fitting algorithm (LMatFit) p2) , a nonlinear succes- 
sive over-relaxation algorithm and (mi) the algorithms Ma- 
trix ALPS I, ADMiRA |21 1, MATRIX ALPS II and Ran- 
domized Matrix ALPS II with QR Factorization (referred 
shortly as MATRIX ALPS II with QR) presented in this pa- 
per. 
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Input: y. A., k, q, Tolerance ri, Maxlterations 
Initialize: X(0) ^ 0, Xq ^ {0}, Q(0) ^ 0, i 
repeat 



{0}, Ti VJ, i ^ 



Vi ^ RANDOMIZEDPOWERlTERATION(PgiV/(Q(i)), k, 



^, ^ argmin^ \\y - A{Q{i) - f^s. V/(Q(i))) \\l 






^Vs,^f{Q{i)) 



Vii) ^ Q{i) 2 

W ^ RANDOMIZEDPOWERlTERATION(V(i), k, q) 

X(i + l)^PvvV(J) 

Q{i + 1) ^ X{i + 1) + r,(X(i + 1) - X(j)) 
Qi+i ^ortho{XiUXi+i) 
i <— i + 1 
until ||X(j) - X{i - 1)||2 < r;||X(i)||2 or Maxlterations. 



(Rank-k subspace via Randomized Power Iteration) 

(Active subspace expansion) 

(Step size selection) 

(Error norm reduction via gradient descent) 

(Rank-k subspace via Randomized Power Iteration) 

(Best rank-k subspace selection) 

(Momentum update) 



Algorithm 4: Randomized Matrix ALPS II with QR Factorization 



— 10"' 



-Miitrix ALPS II [39.1954] 

-Matrix ALPS II - SVD ortho [43.6059] 

-Matrix ALPS II - QR ortho [53.4819] 




# u£ iterations 

(a) 



10"' 


V Matrix ALPS II [5.1442] 

\ Matrix ALPS II - SVD ortho [6.002] 
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w 
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*. 
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10"' 


\ 


Matrix ALPS II - SVD 
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\\ 


- 
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Fig. 5 Median error per iteration for MATRIX ALPS II variants over W Monte-Carlo repetitions. In brackets, we present the mean time consumed 
for convergene in seconds, (a) n = 1024, m = 256, p = 0.25n^ , and rank k = 20. (b) n = 2048, m = 512, p = 0.25n^ , and rank k = 60. (c) 
n= 1000, m = 500,p = 0.25n2, and rant fe = 50. 



10.2 Implementation details 

To properly compare the algorithms in the above list, we pre- 
set a set of parameters that are common. We denote the ratio 
between the number of observed samples and the number of 
variables in X* as SR :— p/{m-n) (sampling ratio). Furthe- 
more, we reserve FR to represent the degree of freedom in 
a rank-A: matrix to the number of observations — this corre- 
sponds to the following definition FR := {k{rn + n — k))/p. 
In most of the experiments, we fix the number of observable 
data p = O.imn and vary the dimensions and the rank k of 
the matrix X*. This way, we create a wide range of different 
problem configurations with variable FR. 

Most of the algorithms in comparison as well as the pro- 
posed schemes are implemented in Matlab. We note that 
the LMaFit software package contains parts implemented in 
C that reduce the per iteration computational time. This pro- 
vides insights for further time savings in our schemes; we 
leave a fully optimized implementation of our algorithms as 
future work. In this paper, we mostly test cases where m ^ 
n. Such settings can be easily found in real-world problems 
such as recommender systems (e.g. Netflix, Amazon, etc.) 
where the number of products, movies, etc. is much greater 
than the number of active users. 

In all algorithms, we fix the maximum number of itera- 
tions to 500, unless otherwise stated. To solve a least squares 



problem over a restricted low-rank subspace, we use con- 
jugate gradients with maximum number of iterations given 
by cg_maxiter :— 500 and tolerance parameter cg_tol := 
10"^". We use the same stopping criteria for the majority of 
algorithms under consideration: 



\X{t)~X{t-l)\\^ 



< tol, 



(33) 



where X{i), X{i — 1) denote the current and the previous 
estimate of X* and tol := 5 -10"^. If this is not the case, we 
tweak the algorithms to minimize the total execution time 
and achieve similar reconstruction performance as the rest 
of the algorithms. For SVD calculations, we use the lansvd 
implementation in PROPACK package |53| — moreover, all 
the algorithms in comparison use the same linear operators 
A. and A.* for gradient and SVD calculations and conjugate- 
gradient least-squares minimizations. For fairness, we mod- 
ified all the algorithms so that they exploit the true rank. 
Small deviations from the true rank result in relatively small 
degradation in terms of the reconstruction performance. In 
case the rank of X* is unknown, one has to predict the di- 
mension of the principal singular space. The authors in pi, 
based on ideas in |48|, propose to compute singular values 
incrementally until a significant gap between singular val- 
ues is found. Similar strategies can be found in |18j for the 
convex case. 
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In Matrix ALPS II and Matrix ALPS II with QR, 
we perform Qi <— ortho(A'j U Xi^i) to construct a set of or- 
thonormal rank-1 matrices that span the subspace, spanned 
by Xi U Afi+i. While such operation can be implemented 
using factorization procedures (such as SVD or QR decom- 
positions), in practice this degrades the time complexity of 
the algorithm substantially as the rank k and the problem 
dimensionality increase. In our implementations, we simply 
union the set of orthonormal rank-1 matrices, without fur- 
ther orthogonalization. Thus, we employ inexact projections 
for computational efficiency which results in faster conver- 
gence. Figure 5 shows the time overhead due to the addi- 
tional orthogonalization process. We compare three algo- 
rithms: Matrix ALPS II (no orthogonalization step). Ma- 
trix ALPS II using SVD for orthogonalization and, MA- 
TRIX ALPS II using QR for orthogonalization. In Figures 
5(a)-(b), we use subsampled and permuted noiselets for lin- 
ear map A. and in Figure 5(c), we test the MC problem. In 
all the experimental cases considered in this work, we ob- 
served identical performace in terms of reconstruction accu- 
racy for the three variants, as can be also seen in Figure 5. 
To this end, for the rest of the paper, we use MATRIX ALPS 
II where Qi -;— A",; U A'i+i. 



10.3 Limitations of 



-based algorithms: a toy example 



While nucluear norm heuristic is widely used in solving the 
low-rank minimization problem, f54P| presents simple prob- 
lem cases where convex, nuclear norm-based, algorithms 
fail in practice. Using the 11 • 11 -norm in the objective func- 
tion as the convex surrogate of the rank() metric might lead 
to a candidate set with multiple solutions, introducing am- 
biguity in the selection process. Borrowing the example in 
154], we test the list of algorithms above on a toy problem 
setting that does not satisfy the rank-RIP To this end, we 
design the following problem: let X* e M^^"* be the matrix 
of interest with rank(X*) — 2, as shown in Figurel6|a). We 
consider the case where we have access to X* only through 
a subset of its entries, as shown in Figure|6|b). 



/2 2 1 1\ 

'2211 

2 2 11 

2 2 11 

\1 1 2 1/ 

(a) 



/2 2 1 1 



1 1 



2 2 

? ? ? 1 

2 ? ? 1 

\1 1 2 1/ 

(b) 



Fig. 6 Matrix Completion toy example for X* G R^*^^. We use '?' to 
denote the unobserved entried. 



In Figure 17] we present the reconstruction performance 
of various matrix completion solvers after 300 iterations. Al- 
though there are multiple solutions that induce the recovered 
matrix and have the same rank as X*, most of the algo- 
rithms in comparison reconstruct X* successfully. We note 
that, in some cases, the inadequancy of an algorithm to re- 



construct X* is not because of the (relaxed) problem formu- 
lation but due to its fast — ^but inaccurate — implementation 
(fast convergence versus reconstruction accuracy tradeoff). 



10.4 Synthetic data 

General afflne rank minimization using noiselets: In this 
experiment, the set of observations y eMP satisfy: 



y = AX* 



(34) 



Here, we use permuted and subsampled noiselets for the 
linear operator A. |12|. The signal X* is generated as the 
multiplication of two low-rank matrices, L G jjmxfc ^jjjj 

ReM"><^ such that X* = LR^ and ||X*|L = LBothL 

M W Jr' 

and R have random independent and identically distributed 
(iid) Gaussian entries with zero mean and unit variance. In 
the noisy case, the additive noise term e G E^ contains 
entries drawn from a zero mean Gaussian distribution with 

||£||2 £{10-3,10-''}. 

We compare the following algorithms: SVP, ADMiRA, 
Matrix ALPS I, Matrix ALPS II and Matrix ALPS 
II with QR for various problem configurations, as depicted 
in Table [T] (there is no available code with arbitrary sens- 
ing operators for the rest algorithms). In Table [T] we show 
the median values of reconstruction error, number of iter- 
ations and execution time over 50 Monte Carlo iterations. 
For all cases, we assume SR = 0.3 and we set the maximum 
number of iterations to 500. Bold font denotes the fastest 
execution time. Furthermore, Figure |8] illustrates the effec- 
tiveness of the algorithms for some representative problem 
configurations. 

In Table [T] Matrix ALPS II and Matrix ALPS II 
with QR obtain accurate low -rank solutions much faster than 
the rest of the algorithms in comparison. In high dimen- 
sional settings, MATRIX ALPS II with QR scales better as 
the problem dimensions increase, leading to faster conver- 
gence. Moreover, its execution time is at least a few orders 
of magnitude smaller compared to SVP, ADMiRA and MA- 
TRIX ALPS I implementations. 

Robust matrix completion: We design matrix comple- 
tion problems in the following way. The signal of interest 
X* G jjmxn jg synthesized as a rank-fc matrix, factorized 
as X* := LR^ with ||X*|| ^ = 1 where L e W"'"' and 



R G 



ixfc 



as defined above. In sequence, we subsample 



X* by observing p — 0.3mn entries, drawn uniformly at 
random. We denote the set of ordered pairs that represent 
the coordinates of the observable entries as il = {(«, j) : 
[X*]jj is known} C {1, . . . , to} x {1, . . . , n} and let Aq 
denote the linear operator (mask) that samples a matrix ac- 
cording to n. Then, the set of observations satisfies: 



y = AnX* 



(35) 



i.e., the known entries of X* are structured as a vector y G 
W, disturbed by a dense noise vector e G M^ with fixed- 
energy, which is populated by iid zero-mean Gaussians. 

To demonstrate the reconstruction accuracy and the con- 
vergence speeds, we generate various problem configura- 



tions (both noisy and noiseless settings), according to ( 35 1. 
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(b) FPC 
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(g) ADMiRA 
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Fig. 7 Toy example reconstruction performance for various algorithms. We observe that X* is an integer matrix 
consideration return real matrices as solutions, we round the solution elementwise. 



(e) OptSpace 
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(j) MATRIX ALPS II 

■since the algorithms under 



Table 1 General ARM using Noiselets. 
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MATRIX ALPS II 
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Theenergyofthe additive noise takes values ||£|L G {10 ^, 
10^"'}. All the algorithms are tested for the same signal- 
matrix-noise realizations. A summary of the results can be 
found in Tables l2j [3] and, |4] where we present the median 
values of reconstruction error, number of iterations and exe- 
cution time over 50 Monte Carlo iterations. For all cases, we 
assume SR = 0.3 and set the maximum number of iterations 
to 700. Bold font denotes the fastest execution time. Some 
convergence error curves for specific cases are illustrated in 
Figures |9] and [To] 

In Table |2] LMaFit |52| implementation has the fastest 
convergence for small scale problem configuration where 
m — 300 and n — 600. We note that part of LMaFit im- 



plementation uses C code for acceleration. GROUSE p9) 
is a competitive low-rank recovery method with attractive 
execution times for the extreme low rank problem settings 
due to stochastic gradient descent techniques. Nevertheless, 
its execution time performance degrades significantly as we 
increase the rank of X* . Moreover, we observe how ran- 
domized low rank projections accelerate the convergence 
speed where MATRIX ALPS II with QR converges faster 
than Matrix ALPS II. In Tables [3] and |4] we increase the 
problem dimensions. Here, MATRIX ALPS II with QR has 
faster convergence for most of the cases and scales well as 
the problem size increases. We note that we do not exploit 
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Fig. 8 Low rank signal reconstruction using noiselet linear operator. The error curves are the median values across 50 Monte-Carlo realizations 



over each iteration. For all cases, we assume p = 0.3mn. (a) m = 256, n = 512, fc = 10 and e 
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Fig. 9 Low rank matrix recovery for the matrix completion problem. The error curves are the median values across 50 Monte-Carlo realizations 
over each iteration. For all cases, we assume p = O.Smn. (a) m = 300, n = 600, fc = 5 and llelL = 0. (b) m = 300, n = 600, fc = 20 and 
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Table 2 Matrix Completion prob 
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stochastic gradient descent techniques in the recovery pro- 
cess to accelerate convergence ■which is left for future work. 

10.5 Real data 

We use real data images to highlight the reconstruction per- 
formance of the proposed schemes. To this end, ■we perform 
grayscale image denoising from an incomplete set of ob- 
served pixels — similar experiments can be found in p2| . 
Based on the matrix completion setting, ■we observe a lim- 
ited number of pixels from the original image and perform 
a lo^w rank approximation based only on the set of measure- 
ments. While the true underlying image might not be lo^w- 
rank, we apply our solvers to obtain low-rank approxima- 
tions. 

Figures 11 and 12 depict the reconstruction results. In 



shown in the top left corner of Figure 1 1 For this case, we 
observe only the 35% of the total number of pixels, ran- 
domly selected — a realization is depicted in the top right 
plot in Figure 11 In sequel, we fix the desired rank to fc = 
40. The best rank-40 approximation using SVD is shown 



in the top middle of Figure 1 1 where the full set of pixels 
is observed. Given a fixed common tolerance and the same 
stopping criteria, Figure[TT]shows the recovery performance 
achieved by a range of algorithms under consideration for 
10 Monte-Carlo realizations. We repeat the same experiment 



for the second image in Figure 12 Here, the size of the im- 
age is 256 X 256, the desired rank is set to A; = 30 and 
we observe the 33% of the image pixels. In constrast to the 
image denoising procedure above, we measure the recon- 
struction error of the computed solutions with respect to the 
best rank-3Q approximation of the true image. In both cases. 



the first test case, we use a 512 x 512 grayscale image as 
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Table 3 


Matrix Completion proble 
Configuration 


m for m 
FR 


= 700 and n = 1000. 
SVP 


"— " depicts no information or not applicable due to time overhead 
Inexact ALM | GROUSE 




m 


n 


k 


II^L 




iter. 


err 


time 


iter. 


err. 


time 


iter. 


err. 


time 


700 


1000 


5 





0.04 


34 


1.9 


10-* 


1.77 


23 


6.5-10-" 


1.69 


- 


3.5-10-" 


0.23 


700 


1000 


5 


10-3 


0.04 


34 


4.2 


10-* 


1.92 


23 


3.7 -10-* 


1.87 


- 


3.1 -10-* 


0.24 


700 


1000 


30 





0.239 


61 


4.6 


10-* 


6.39 


29 


1.2- 10-* 


3.91 


- 


3.2 -10-" 


3.15 


700 


1000 


30 


10-^ 


0.239 


61 


1.1 


lo--^' 


6.33 


29 


1- lo--^' 


3.87 


- 


8 -10-* 


3.14 


700 


1000 


50 





0.393 


95 


8.5 


10-* 


14.47 


49 


3.2- 10-* 


9.02 


- 


1.3 -10-" 


10.31 


700 


1000 


50 


10-^ 


0.393 


95 


1.6 


lO-'^' 


15.15 


49 


1.4- lO--^* 


9.11 


- 


8 -10-* 


10.34 


700 


1000 


110 





0.833 
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1.2 


10-^ 


253.1 


374 


5.8- lO--^* 


152.61 


- 


1.2 -10-' 


110.93 


700 


1000 
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lO-'^ 


0.833 


682 


1.3 


10-^ 


256.21 


374 


6.8- lO--^* 


154.34 


- 


1.05- 10-' 


111.05 






LMaFit 


Matrix ALPS II 


MATRIX ALPS II with QR 


ra 


n 


k 






iter. 


err 


time 


iter. 


err. 


time 


iter. 


err. 


time 


700 


1000 


5 





0.04 


24 


7.2 • 10-" 


0.67 


8 


1.5 -lO-" 


1.15 


15 


8.3 -10-" 


1.05 


700 


1000 


5 


10-^ 


0.04 


17 


3.7-10-4 


0.5 


10 


4.5- 10-* 


1.38 


15 


3.8 -10-* 


1.1 


700 


1000 


30 





0.239 


34 


9.2-10-'' 


1.95 


14 


4.5- IQ-" 


3.69 


35 


1.1 -10-* 


2.6 


700 


1000 


30 


10-^ 


0.239 


30 


I-IO-'^ 


1.71 


25 


1.1- 10--=' 


6.1 


35 


1-10-'=' 


2.61 


700 


1000 


50 





0.393 


53 


2.7-10-=' 


4.59 


25 


8.6 -lO-" 


8.87 


57 


1.6 -10-" 


4.47 


700 


1000 


50 


10-^ 


0.393 


52 


1.4-10-^ 


4.53 


40 


1.6- 10-^ 


14.38 


57 


1.4-10-^ 


4.49 


700 


1000 
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9-10-'' 


101.95 
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8- 10-* 


214.93 
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7 -10-* 


51.72 
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10-a 


0.833 
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3.7-10-'=' 
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4.7- 10--=' 


261.98 
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51.62 



Table 4 Matrix Completion problem for m = 500 and n = 2000. "— " depicts no information or not applicable due to time overhead. 
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m 
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err. 
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err. 
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32 
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- 
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32 
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64 
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32 
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6.51 


- 


1.6 -10-* 
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0.408 
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1.1 


10-* 


15.74 


54 


5 -10-* 
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- 


8 -10-" 
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2000 50 


10-'=' 


0.408 
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1.8 


lo--^* 


24.97 


54 


1.55- lO--^* 


16.14 


- 


9 -10-* 


8.6 


500 


2000 50 


10-* 
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1.1 


lo--^* 


24.85 


54 


5 -10-* 


16.17 


- 


7 -10-" 


8.59 
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239 
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10-^ 
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- 
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79.64 
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- 
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3.6 
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60.76 


- 
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173.14 


- 


4.5-10-^ 
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2000 100 


10-^ 


0.8 
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10--^ 
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7-10-^ 


176.04 


- 


5.2-10-^ 
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10-* 


0.8 
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1.1 


10--^ 
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170.47 


- 
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LMaFit 


MATRIX ALPS II 


MATRIX ALPS II with QR 


m 


n k 


II^L 




iter. 


err. 


time 


iter. 


err. 


time 


iter. 


err. 


time 
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13 
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37 
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0.083 


37 
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22 
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37 
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35 
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13 
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37 
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0.408 


60 
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22 
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60 
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60 
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36 
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59 
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10-* 
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60 


2- 10-* 


7.29 


22 


2 -10-* 


11.87 


59 


2 -10-* 


6.75 
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0.645 
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3 -10-* 


33.65 


61 


2 -10-* 


49.53 
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3 -10-* 


18.66 


500 


2000 80 


10-'=' 


0.645 
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2.3- ID--^* 


33.48 


92 


2.4- lO--^* 


75.51 
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2.3 -10--=^ 


18.87 


500 
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10-* 


0.645 
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3- 10-* 


33.47 


61 


4 -10-* 


49.52 


151 


3 -10-* 
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0.8 
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1.5 -10--=' 


115.11 
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4 -10-* 


153.74 
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55.1 


500 


2000 100 


10-^ 


0.8 
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0.8 
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154.46 


428 
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■we note that MATRIX ALPS II has a better phase transition 
performance as compared to the rest of the algorithms. 



11 Discussion 

In this paper, -we present ne-w strategies and revie-w existing 
ones for hard thresholding methods to recover low-rank ma- 



trices from dimensionality reducing, linear projections. Our 
discussion revolves around four basic building blocks that 
exploit the problem structure to reduce computational com- 
plexity -without sacrificing stability. 

In theory, constant jii selection schemes are accompa- 
nied -with strong RIP constant conditions but empirical evi- 
dence reveal signal reconstruction vulnerabilities. While con- 
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Fig. 10 Low rank matrix recovery for the matrix completion problem. The error curves are the median values across 50 Monte-Carlo realizations 
over each iteration. For all cases, we assume p = O.Smn. (a)m = 700, n = 1000, k = 30 and ||e|| = 0. (b) m = 700, n = 1000, fc = 50 and 
\\s\\^ = 10-3. (c)m = 700, n = 1000, k = 110 and IJeH^ = 0. (d) m = 500, n = 2000, fc = lOandHeH^ = 0. (e) m = 500, n = 2000, 
k = 50and||e||2 = lO^^. (f) m = 500, n = 2000, fc = 80and||e||2 = lO"". 



vergence derivations of adaptive schemes are characterized 
by weaker bounds, the performance gained by this choice 
in terms of convergence rate, is quite significant. Memory- 
based methods lead to convergence speed with (almost) no 
extra cost on the complexity of hard thresholding methods — 
we provide theoretical evidence for convergence for simple 
cases but more theoretical justification is needed to general- 
ize this part as future work. Lastly, further estimate refine- 
ment over low rank subspaces using gradient update steps or 
pseudoinversion optimization techniques provides signal re- 
construction efficacy, but more computational power is needed 
per iteration. 

We connect e-approximation low-rank revealing schemes 
with first-order gradient descent algorithms to solve gen- 
eral affine rank minimization problems; to the best of our 
knowledge, this is the first attempt to theoretically charac- 
terize the performance of iterative greedy algorithms with 
e-approximation schemes. In all cases, experimental results 
illustrate the effectiveness of the proposed schemes on dif- 
ferent problem configurations. 
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A Appendix 



'"^" with SVD: X = USV^, and Y G 



Remark 1 Lei X e 

romxn ,^yith SVD: Y = USV^ . Assume two sets: i) Si 



{uiuf : 
i S Xi } where Ui is the i-th singular vector of X and Xi C { 1 , . . . , 
rank(X)} and, ii) S2 = {uiuf ,UjUj'^ : i g I2, j £ 2^3} where 
Ui is the j-th singular vector of Y , Ii C I2 C {1, . . . ,rank(X)} 
and, I3 C {1, . . . , rank(y )}. We observe that the subspaces defined 
by Uiuf and tljUj"^ are not necessarily orthogonal. 

To this end, let S2 = ortho(52); this operation can be easily com- 
puted via SVD. Then, the following commutativity property holds true 
for any matrix W e K™^": 



VsiVgW = VgVsrW. 



A.l Proof of Lemma |6] 



(36) 
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Given X* <— VkiX") using SVD factorization, we define the follow- 
ing quantities: Si <— Xi U Vi, S* -f— ortho (A'i U X*). Then, given 
the structure of the sets Si and S* 



^5i^{S*)^ =^-D>^(A"UA-i)^' 



(37) 
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Fig. 11 Reconstruction performance in image denoising settings. The image size is 512 x 512 and the desired rank is preset to fc = 40. We 
observe 35% of the pixels of the true image. We depict the median reconstruction error with respect to the true image in dB over 10 Monte Carlo 
realizations. 



and 



- 'Px'VcDiUXi)^ 



(38) 



Since the subspace defined in Vi is the best rank-fc subspace, orthogo- 
nal to the subspace spanned by Xi, the following holds true: 



p5.V/(X(i))||^ > ||p5-V/(X(i)) 



Removing the common subspaces in Si and S* by the commutativity 
property of the projection operation and using the shortcut Vj^yg = 
VaPi31_ for sets A, B, we get: 



1^ > ||7'sn5.w(x«)||^ 



\\Vs,\st^f{X{i) 



(39) 



Next, we assume that V, 



(A\B)- 



denotes the orthogonal projection onto 



the subspace spanned by VaPb^ ■ Then, on the left hand side of d39j, 
we have: 



\\Vs^\siA.''A{X* - X(j)) + Ps,\s*-^*£|| 



II' Si\S*-^ '■Wp 



< ||7's.\5->t*^(X•-X(^))||^■ 
=' ||^5A5-(^* - X{i))+Vs^\siA'A{X' -X{i 

+ ||^s.\s*-^*=IIf 

'=' \\{l-Vs^\s'^A*AVs^\s;){X'' - X{i)) 

< \\{i-T^s,\s:A*AVs^\s:)iX* - X{i))\\^ 

+ \\Vs^\s:A*AV^s^\s,-^^{X* - X(i))||^ + \\Vs,\s;A*e 



Vs^\S^A*e\\p 
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Fig. 12 Reconstruction performance in image denoising settings. The image size is 256 x 256 and the desired rank is preset to fc = 30. We 
observe 33% of the pixels of the best rank-30 approximation of the image. We depict the median reconstruction with respect to the best rank-SO 
approximation in dB over 10 Monte Carlo realizations 



< S3k\\X* 



^«IIf + Ps.\s*-^*^IIf 



+ ||p5,\5.^*>tP(5^\5.)^(X* - X{i 
^< S3k\\X*-X{{)\\^ 



Vs^\S^A*e\\p 



< 2<53fc||X*-X(i)||^- 



PsAsr-^'HIj. 



(40) 



where (i) due to triangle inequality over Frobenius metric norm, (ii) 
since VsiXS' {X{i) - X*) = 0, {Hi) by using the fact that X(i) - 
X" ■■= Vs^s: (X{i) - X*) + P(5^\5.)_L(X(i) - X*), (iv) due 



to Lemma SI (v) due to Lemma | 

X{i))\\p<\\X{i)-X*\\^ 



Island ( 



vi) since rP 



(SiW) 



..^(X* 



For the right hand side of \39) , we calculate: 

>||Ps.\s,(X*-X(i))||^-252fc||x(i)-X*||^ 

by using LemmasHandB] Combining Bol and ET) in l |39| , we get: 

\\rx*\s,X''\\p < (2S2k + 2S3k)\\X{i) - x*\\p 



+ ^2{1 + 52k) 
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A. 2 Proof of Theorem [T] 

Let X* <— VkiX*) be a set of orthonormal, rank-1 matrices that span 
the range of X*. In Algorithm l,W{i) ^ ^^(^(1)). Thus: 

\\W{t)-V{i)\\l<\\X*-V{i)\\l^ 
\\W{i) -X*+X'- V{i)\\l < \\X* - V{i)\\l ^ 



and thus: 



|w^(j)-x*|r < 2{w{i)-x'',v{i)- X*) 



(42) 



From Algorithm 1, i) V{i) G span(5i), ii) X{i) g span(5i) 
and Hi) W{i) e span(5i). We define £ •(— ortho(5i U X") where 
rank(span(£')) < 3fc and let Vg be the orthogonal projection onto the 
subspace defined by £. 

Since W{i) - X* e span(f ) and V{i) - X* e span(f ), the 
following hold true: 

W(i) - X* = re(W{i) - X*) and 
V{i)-X* =V£(V{i)-X'). 



Then, J42f can be written as: 

\W{t) - X'Wl < 2(Ve{W{i) - X*),V£{Vii) - X*)) ^ 

= 2{V£{W{i) - X*),V£{X{i) -X* - ^nVs^A'AiXii) - X*))) 



\\{l- ,l,Vs,A*AVs,)V£{X{^) - X*)\\p 

<-^\\V£{Xi^)^X*)\\^. 

Furthermore, according to LemmaB] 

\\Vs^A''AVs^V£iXii) - xn\\p < Ssk\\Vs^V£{X{t) - X')\\p 

since rank(7'K:X) < 3A:, VX e M™x" for/C ^ ortho(£'U5i). Since 
Vs±V£{X{i) - X") = Vx*\(T>,ux,)X* where 

V,^Vk(v^^^Vf{X{i))y 

then: 

\\Vs^V£{X{t) ~ X')\\^ = \\Vx'\iT,^ux^)X*\\^ 
< {252k + 253k)\\X{i) - X'Wp + ^2{1 + 52k)\\e\\^, 
using Lemma[6] Combining the above in l|45f, we compute: 



A< 



^''' +(25,, + 253.):''^'' 



■1-S2k 

I^«-^1If 



1-S2 



\W{i) - X*\ 



25, 



1-^2 



\W{i) - X*\\^^/2{1 + 52k)\\e\\ 



+ 2M'Ps{Vi'(i) - X*),V£Vs,{A*e)) 



In B, we observe: 



B := 2fi,{V£(W{i) - X*),r£VsM*^)) 
= 2fi,{W{'i) - X'-,Vs,{A'-e)) 

^< 2f,,\\Wii) - x-lUvsM'^^F 
< 2^1,^1 + S2k\\W{i) - X*\\\\e\L 



(43) 



(46) 



< 



Combining l |44[ l and ^46^ in ^43 1 , we get 

2(53 fc 



W(i)-X*\\^ 
452 fc 



1 - (52fc 

/ 2Vl + 52fc 
V l-'52fc 



+ (252fc + 253fc) 



25; 



1 — S2k 

""= V2(l + 520 



x{i)-x*\\^ 



(47) 



(44) 



where (i) holds since Ps^Pf = VeVs^ = Vs^ for span(5i) e 
span(£'), (ii) is due to Cauchy-Schwarz inequality and, {Hi) is eas- 
ily derived using Lemma|2] 

In A, we perform the following motions: 



Focusing on steps 5 and 6 of Algorithm 1, we perform similar 
motions to obtain: 

•l + 252fe^ 



||x(. + i)-x1U<(i±^)||w(.)-x1 



I -5k 



(48) 



A := 2{W{i) - X*,V£{X(i) - X*) - pL^Vs^A* AV£{X{i) - X*)) Combining the recursions in J47j and ((48), we finally compute: 



^^ 2{W{i) - X' ,V£{X{i) - X") 

- i,,rs,A*A[Vs,+Vsi-]V£{X{i) - X*)) 

= 2(W(i) - X\ (I - ii,Vs,A*AVs,)V£{X(i) - X*)) 

- 2tJi,{W{i) ~ XWs^A* AVg_LV£{X{i) - X*)) 

< 2\\w{i) - x*\\p\\{i-ti,rs,A*Ars,)r£{x{i) - x*)||^ 

+ 2,i,\\W{i) - X*\\^\\Vs,A*AVg±V£{X{i) - X*)\\p (45) 

where (i) is due to V£{X{i) - X*) := rSiV£{X{i) - X*) + 
'Pqi_ V£ (X (i) — X* ) and (ii) follows from Cauchy-Schwarz inequal- 



||X(i -hi)- X*||^ < p||X(i) - X*||^ -h7||e||2 
for P := (^^) (,#|f^ + (252. + 253.)32^ 
■l + 2&2k\(•2^/T+&^. , 253fc 



and 



+ 



l-'52fc 

l-<5fc 



)(^ 



-<52 



l-<52 



V2(l + '52fc)) 



ity. Since -^-k — 



< Hi < 



x{l-HiVs,A*APsi) e 



, Lemmaptlimplies: 



-, _ 1 - ^2k 1 + ^2fc _ , 



For the convergence parameter p, further compute: 
(V^)(^ + (2^.. + 253.)^) 

^ 1 - 021. / V 1 - 52)!. 1 - 52fc / 



1 + <52fc 1 - <52fc 



^ l + 253fe , 



^Ik) 



(49) 



2^2fc 
1 - &2k 



for 5fc < 52fc < 53fe. Calculating the roots of this expression, we easily 
observe that p < p < 1 for 534. < 0.1235. 
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A. 3 Proof of Theorem |2] 

Before we present the proof of Theorem[2] we list a series of lemmas 
that correspond to the motions Algorithm 2 performs. 



Lemma 9 [Error norm reduction via least-squares optimization] Let 
Si be a set of orthonormal, rank-1 matrices that span a rank-2k sub- 
space in K 



Then, the least squares solution V{i) given by: 



V{i) -h- argmin ||j/ — .4.V|| , 

V:Vespan{Si) 



(50) 



satisfies: 



\V{i)~X* 



1 



Vl+^2fc || 
1 - ^.^fc 



\\Vs^{V{i)~X*] 



(51) 



Proof We observe that || V(«) — X*\\ is decomposed as follows: 

\\v{i) - x*\\l = \\VsAvi^ - xn\\l + \\rs±{v{i) - x^Wl- 

(52) 

In |50l, V{i) is the minimizer over the low-rank subspace spanned by 
Si with rank(span(5i)) < 2k. Using the optimality condition (Lemma 
Ut over the convex set = {X : span(X) e cS^}, we have: 



{AV(i)-y,AVsAV{i) - ^*)> < 0. 



(53) 



for Vs^X* e span(c5 
right hand side of (J52ll 



Si). Given condition d53|, the first term on the 
becomes: 



\\Vs,{V{i)-X*)\\'^ 

= {V{i)-X*,Vs^{V{i)-X*)) 

{53} 

V(v(i)-x*,ps,(v(0-x*)> 

- {AV{t} - y,AVs,{V{i) ~ X*)) 

< \{V{i) - X*, (I - A*A)VsAV{i) - X*))\ 

+ (e,AVs,{V{i)-X*)) 



(54) 



Focusing on the term |(V(j) - X* , (I - A*A)Vs,{V{i) - X*))\, 
we derive the following: 

\{V{i) - X\{I - A'A)Vs,{V{i) ~ X*))\ 

= \iy(i) - x\vs^{v(:i) ~ X*)) 
- lyiS) - x'^A'AVs, (v(i) - x*))\ 



(i) 



ICPs^ux-iVii) - X*),Vs,{V{i) - X*)> 

- (AVs^ux'iVii) - X'-),AVs,{V{i) - X*)>| 

=' K-Ps^ux' {V{i) - X*),Vs,ux'Vs,{V{i) - X*)) 

- {AVs.ux- (V(i) - X'-),AVs,ux'VsAV{i) - ^*)>l 

= \{V{i) - X*,{I - Vs,ux'A*AVs,ux')Vs,{V{i) - X*))| 

where (i) follows from the facts that V{i) — X* 6 span(ortho(5i U 
A'*))andthusP5^uA->(V^(i)-^*) = Vii^-X* and (ii) is due to 
T-'SiUX'T-'s- = Vsi since span(5i) C span(ortho(5i U X*)). Then, 



\5A\ becomes: 

\\VsdV(i)~X*)\\l 

< \{V{i) - X*, (I - Vs,uX'A*AVs,uX')VsAV{i) -X*)}\ 
+ {e,AVs,{V{i)-X*)) 

< \\V{i) - X'\\p\\{I-Vs,uX'A*AVs,uX')VsAV{i) ~ X* 



\\Vs,A*e\\\\Vs,{V(i) - X*)\ 



(") 



< S3k\\VsAV{i) - X*)\\JV{i) ~ x'l 
+ Vl + S^\\Vs,{V{i) - X*)||^||e||2, 



(55) 



where (j) comes from Cauchy-Swartz inequality and (ii) is due to 
Lemmas 12] and [4] Simplifying the above quadratic expression, we ob- 
tain: 



\\VsdV{i) - X*)\\p < S3k\\V({) - X*\\^ + ^l + 52k\\e\\ 



As a consequence, J52ll can be upper bounded by: 



(56) 



V{i) ~ X*||^ < (53fe||V(i) - X*||^ -h Vl + 52fe||e|y^ 



(57) 



We form the quadratic polynomial for this inequality assuming 
as unknown variable the quantity 11 V^(i) — X* 11 Bounding by the 
largest root of the resulting polynomial, we get: 



|v(i)-x*|L < 



1 



l-<53fc "^"2 



\Ps^(y{i)-x*) 



(58) 



The following Lemma characterizes how subspace pranmg affects 
the recovered energy: 

Lemma 10 {Best rank-k subspace selection] Let V{i) e M."^^" be 
a rank-2k proxy matrix in the subspace spanned by Si and let X{i -\- 
1) <— 7-'fc(V(i)) denote the best rank-k approximation to V{i), ac- 
cording to (IjI. Then: 

\\X{i + 1) - V(i)||^ < \\VsAV{i) - X*)\\p < \\V{i) - x*\\^. 

(59) 

Proof Since X(j -|- 1) denotes the best rank-fc approximation to V{i), 
the following inequality holds for any rank-fc matrix X g jjmxn jjj 
the subspace spanned by Si, i.e. VX e span(cSi): 



\\X{i-\-l)~V{i)\\ < \\X -V{i 



(60) 



Since Vs-^ii) = ^ (0. the left inequality in j59[ is satisfied for 
X :=-P5',X* in J60f. 

Lemma 11 Let V{i) he the least squares solution in Step 2 of the 
ADMiRA algorithm and let X{i-\-l) beaproxy, rank-k matrix to V{i) 
according to: X{i + 1) <- VkiVii)). Then, |jx(j + 1) -X*||^ can 
be expressed in terms of the distance from Vii) to X* as follows: 



||X(i + 1) - X*||^ < ^1 + ^Sl^\\V{i) - x*||^ 

/3(l-h52fc)|| 



(61) 
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Proof We observe the following 

\\X{i + 1) - x'Wl = \\X{i + 1) - V{i) + V{i) - x*\\l 

= \\V{€) - X% + \\V(i) - X{i+l)\\l 

^2{V(i)-X'',V{i)-X{i + l)). (62) 

Focusing on the right hand side of expression \62\ , {V{i)—X*, V{i)~ 
X{i + 1)) = {V{i) - X*,Vs,{V{i) -X(T+1))) can be similarly 
analysed as in Lemma 10 where we obtain the following expression: 

\(Vii)-X*,Ps,iV{i)-X{i+im 
<S3k\\V{i)-X*\\^\\V{i)-X{i + l)\\^ 



+ V'^ + ^2k\\V{i) - X{i + l)\\^\\e\\^. (63) 

Now, expression (|62[l can be further transformed as: 



'*l|2 



(i) 



'*l|2 



+ 2(53fc||V« - X'\\^\\V{i) - X(i+ 1)11^ 
+ Vl + 52fc||V(i)-X(j + l)||^||e|y (64) 

where ( j) is due to j63^ . Using LemmaflO] we further have: 

\\X{^ + 1) - X*\\l < \\V{i) - X*||^ + \\VsdV{i) - X*)\\l 

+ 2(s3k\\V{i) - X'W^WVs^iVii) - X')]]^ 



+ Vl + S^\\Vs,{V{t) - A:*)||^||e|| 



(65) 



Furthermore, replacing llPg. (X* — V(i))|| with its upper bound 
defined in l|56|, we get: 



||X(i + l) - X* 
^^ (^ + 3Sh]{\\V{i)-X* 



i + s^l. ' 



(66) 



where (i) is obtained by completing the squares and eliminating nega- 
tive terms. 

Applying basic algebra tools in j61^ and ^51^, we get: 



|x(i + i)-x*|L < 



1+3^1 



\\Vs±{V{i)-X*) 



1 + 351,, 



A.4 Proof of Theorem [3 

Let X* •(— Pfe(X*) be a set of orthonormal, rank-1 matrices that span 
the range of X*. In Algorithm 3, X{i + 1) is the best rank-A: approx- 
imation ofV{i). Thus: 



\\X{i + l)-V{i)\\l.<\\X*-V{i)\\l^ 
\\Xii + l)-X*\\'i < 2(X{i+l)~X*,V{i)-X'-) 



(68) 



From Algorithm 3, i) V{i) e span(iSi), ii) Q, £ span(iSi) and 
m) W{i) g span(cSi). We define £ -f— ortho(iSi U X") where we 
observe rank(span(£')) < 4k and let Vg be the orthogonal projection 
onto the subspace defined by S. 

Since X{i + 1)- X* e span(£') and V{i) - X* e span(£:), the 
following hold true: 

X{i +1)-X* = V£{X{i + 1)- X*), 

and, 

V(i)-X* =V£{V{i)-X'-). 

Then, \6S) can be written as: 

||x(* + i)-x-||^ 

< 2{V£iX{i + I) - X*),V£{V{i) - X*)) 

= 2{V£{X{i -hi)- X*),V£ {Q.+t^^Vs^A'AiX* - QJ - X*)) 

= 2(X(i-h 1) - X*,V£{Q, - X*) 
-f,,Vs,A'A[Vs, +Ps^]V£{Q,-X*)) 

= 2(X(i + 1)- X*, (I - ^J.^Vs,A'■APsJV£{Q, - X*)> 
- 2tM,{X{i+l) - X*,Vs^A*APg_LV£{Q, - X*)) 

< 2||x(i-h 1) - ^ij,||(I - f^,rs,A''AVs,)V£iQ, - X*)\\p 
+ 2ii,\\x{i + l)-X*\\p\\Vs,A*AVs±V£{Q,~X'-)\\^ (70) 

where(i)isduetoP£(Q,-X*) ■.= Vs^V£{Q^-X*)+Vg_LV£{Q^- 
X*) and (ii) follows from Cauchy-Schwarz inequality. Since -^^ — ^ 
fii < Y^s — , LemmaWlimplies: 



X{I-fj.iVs,A*AVs,) e 



l-S^k 1 + 5., 



1 -h 5.-,fc 1 - 5, 



253fc 
1 - 5., 



+ [^, -, -hV^K/l + 52fe||e|L. 

Since V(i) e span(5i), weobserve-p^i (V(i)-X*) = -Vq^lX* 
-'Px*\('Di\jXi)X'' ■ Then, using LemmalST we obtain: 

|X(i + l)-X*||^ 



X(i)| 



and thus: 



I (I - tJi:Vs,A'-APs,)V£{Q, -X')\\p 



< 



V£{Q,-X*)\ 



[2&2k + 2531 


1 ^-sik 


l^ + ^^ik 




V2{l + 53k) 



+ 



+ ( —, 7 + v^l Vl + 52fe 



1 - 53fc ' 
Furthermore, according to LemmaB] 

\\Vs^A*AV^^V£{Q, - xn\\p < S^kWVs^VeiQ, - X*)||^ 

since rank('Px:Q) < 4A;, VQ G R"ixn .^^,here K *r- or\ho{£ U Si). 
SmceV^±V£{Qi - X*) = Vx*\cd,uxoX* where 



(67) 



Given 52k < 53^^, p is upper bounded by p < 4(53fe^/-| 



+ 3«3 



then: 

\\Vs^V£{Q,-X*)\\^ = \\V, 



1+3^31. 



Then, 453fe /i±:^ < I ^ S^^ < 0.2267. 



+ 254fc)||Q, -X* 



^*|L< (253fe 



(71) 
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9{i + l)< 



a{l + n) + y^Y+^ /Q(l + r,)-V^V + ' 






(fcl+fe2)( 



i{l + n) + ^/AY 



|X(0)-X*||^ 



\X(0)~X' 



(69) 



using Lemma[6] Using the above in ( |70[ l, we compute 

||X(i + l) ~ X* 
< 



If 
4<53fc 

1-.53* 



■ (253fc + 254k) 



253k 



1-53 



Q, -X* 



If 
(72) 



Furthennore: 



iQi - X* 11 = ||x(i) + n(X{i) - X{i - 1))| 



(l + T,)iX{i)-X*)+n{X*-X{i-l))\\^ 

F 

(73) 



< (1 + r,)||X(i) - X*||^ + n\\X{i -1)- X*\\^ 



Combining p2\ and |73), we get: 

||X(* + 1)-X*||^ 

<{l + n)(.^ + {253k+25,k) ^''' 



1-5. 



3k 



\X{i) - X*\ 



+ n{.^ + {253k+25,k).^ 



1 - ^Sk' 

X{i- 1) - x*\ 



(74) 



j#fi^+(2<53fc + 254fc)3^andg(i) := \\X{i + l) 



Let Q 

X* II . Then, |74| defines the following homogeneous recurrence: 

git + 1) - q(1 + n)g{i) + ang{i - 1) < (75) 

Using the method of characteristic roots to solve the above recurrence, 
we assume that the homogeneous linear recursion has soluti on o f the 
fonn g{i) = r* for r 6 M. Thus, replacing g{i) = r'- in J75| and 
factoring out r^^~^\we form the following characteristic polynomial: 



r — a(l + Ti)r — ar^ < 



(76) 



Focusing on the worst case where ^6^ is satisfied with equality, we 
compute the roots ri.2 of the quadratic characteristic polynomial as: 

n.. = "^^ + ";^^^ , where ^:=a^(l + r0^+4»r,. 

Then, as a general solution, v ve co mbine the above roots with unknown 
coefficients 6i, b2 to obtain l |69| . Using the initial condition g{0) := 
\\X{0) - X*\\p x(02=o ||j^*||^ ^ -^^ ^g gg( f,^ ^ f,2 = 1. Thus, 
we conclude to the following recurrence: 



A. 5 Proof of Lemma |2] 

Let©,' ^ Vf^{Vx±\'f{X{i)))andV, ^ Pfc(P^j_ V/(X(i))).Using 
Definition!?] the following holds true: 

||Pi,.V/(X(*))-V/(X(i))||^ 

< {l + e)\\rT,^yf(,X(i))-Vf(X{i))\\l. ill) 



Furthermore, we observe: 



||V/(X«)||^ = ||V/(X(i))||^^ 
||7'i,.V/(X(i))||^ + ||P(i,.)^ V/(X(i))||^ = 

\\V;^,\X^^f{X(i))\\l + ||P(^.^^^)^ V/(X(i)) 



(78) 



Here, we use the notation defined in the proof of Lemma 6. Since 
P-p. V/(X(i)) is the best rank-fe approximation to V/(X(i)), we 
have: 



||Pi,.V/(X(i))-V/(X(i))||^ 

||P;,.\;,,V/(X(i))-V/(X(i)) 



|^<||7'(;,.\;,^)^V/(X(i))||^^ 



||7'^_LV/(X(i) 

(l + .)||P^^V/(X(i))||^ < (1 + €)||P(;,..;,.^V/(X(-^"' 



F 

(79) 



where rank(span(ortho(A" \ Xi))) < k. Using J77j in |79), the fol- 
lowing series of inequalities are observed: 



||P(j,..^V/(X(i))||^<(l + e)||7'j,^V/(X(i))"' 






Now, in Its), we compute the series of inequalities in (8Tl-([82l. Fo- 



cusing on 



x*\x, 



A*{v-AX{i 



, we observe: 



11^. 
IIP, 



(;,.\;,^)^^*(^X*+£-^X(i))||^ 



(;,.\;,,)^^*AX*-X(i))||^- 



X'\Xi 



\A*e\\j^ < 2X 



\\A'AiX*-X{i))\\^- 

Moreover, we know the following hold true from Lemma[6] 

||p5,ys.^*^(X* - Xii)) + Vs^\s:A-e\\^ 

< 253k\\X* - X(»)||^ + \\Vs^^s^A*e\\^ 

and 



(83) 



(84) 



||p5.\S^>t*^(X* - X{i))+Vs^\sA''4F 

>||7's*\s.(X*-X(i))||^-252fc||X«-X*||^ 

- \\rs^\s,A*e\\p (85) 

Combining |83l-([85j in l |82| , we obtain: 

|Ps*\Si^*||^ = |PA:*\Si^*||j7 

< (2<52fc + 253fe)||X(i)-X' 

+ 2Xy/l 



-hV2(l + 52fe)||e|| 
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||P^.V/(X(i))||^ 

||P^.V/(X(i))||^ + (l + e) 

||Pc.V/(X(i))||^ + . 

||Pi,.V/(X(i))||!, + ||P;f,V/(X(-^^"^ 



P(^.)_LV/(X(i 



\x*\x,)^ 
\x*\x,)^ 



(x*\xo-^ 
\x'\x,)^ 



Vf(X{i 
Vf{X{i 
Vf{X{i 
Vf{X{i 



V/(X(i 






\x'\x,)^^f{X{i)) 
\Vx*\x^yf(X{-0)f 



,UII^. — ' 

\l> 
\l> 
\l> 
\l> 






(81) 



F 
|P;,.\;,,V/(X(i))||^ + \\Vx,^f{X{i))\\l 

2 



|P5.V/(X(i))||l«» 



|p5.\5,V/(X(i 

WVs-^xs.A-'iy-AXi 
\Vsi;\s,A*{y~AX{ 



(82) 



252fe 



ll^w 



I V(.) - Xi, < ^(l + ^) (2.., + 2.3. + 4)) + ^_ ^^^ 

+ f(l + ^)y2(lT^+ ^^i±^l ||s||, + (1 + -^)2Av/J. 

L I -02k 1 - 02fc J -^ 1 - 02fc 



(86) 



A. 6 Proof of Theorem |4] 

To prove Theorem [4] we combine the following series of lemmas for 
each step of Algorithm 1 . 

Lemma 12 [Error norm reduction via gradient descent] Let Si <— 
ortho{Xi U Vl) be a set of orthonormal, rank-1 matrices that span a 
rank-2k subspace in K™^". Then \86\ holds. 

Proof We observe the following: 

\\v(i) - x*\\i = \\ps,{yi') - ^*)IIf + \\T^s^iv{i) - X* 



(87) 



The following equations hold true 

\\Vs^(V{i)^X*)\\l = 
Furthermore, we compute 



^s^^*\\l = Px'\s^^*\\l 



\\Vs,(V{i) - X')\\^ = WVs^XiS) - fVs,^f{X{i)) - x*)||^ 

= ||Ps. (X{i) - X*) - ti,Vs,A*A{X{i) - X*) + f,,rs,A*e\\j, 

<\\{l-f^^Vs,A''AVs,-PsAX{i)-X')\\p 

+ li,\\Ps,A'APg±iX{i) - X*)\\p + li^Wrs.A'eWp 



W _2<52fc II , , , *,|| 



l-<5. 



2fe 



1 - S-2k 



^ Vl + 52fc|| II 

+ ^r X r 2 

1 - dat " "2 



V^^iX{i)-X*) 



(88) 



where (i) is due to Lemmasplwlpland j— ^ — 1^ P-i < jttj 

Using the subadditivity property of the square root in j87 
Lemmapjand the fact that \\rs,{X{i) - X*)\\p <\\X{i) 
we obtain: 

||v(i) - x*||^ < \\VsAv{i) - x*)\\^ + \\Vs^{v{V) - x')\\^ 

&3k 



<~p\\X{{)-X*\\^ 
.... <5: 



1 



)A\Tk-\x.^*4F 



1 - <52fc 

VI + 52fc 



[(1 + ^)^2(1 + 5,,) + ^ 

L i — d2k i 

where p := {l + 3^) (2,52, + 2.3^) + ^ 



(89) 



2^2fc 



We exploit Lemma[8]to obtain the following inequalities: 



\W, 



\W, ~V(i) + V{i?)- x*\ 



<\\W, -V{i)\\^ + \\V{i) - x*\\^ 

) 

< (2 + €)||V(i)-X* 



< (1 + e)\\W(i) - V(i)||^ + \\V{i) - X'W^ 



(90) 



where the last inequality holds since W(i) is the best rank-fc matrix 
estimate of V(i) and, thus, \\W{i) - V(i)\\p < \\V{i) -X*||^. 

Following similar motions for steps 6 and 7 in Matrix ALPS I, we 
obtain: 



\\X{i + 1) - X*\ 



< 1- 
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1 



w.-x* 






(91) 



Combining ^91^, l|90| and J89l, we obtain the desired inequality. 
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